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DEVELOPMENT OF AN INSTRUMENT FOR 
STUDYING VERBAL BEHAVIORS IN A SEC- 
ONDARY SCHOOL MATHEMATICS 
CLASSROOM 


E. MURIEL J. WRIGHT 
Washington University 


MUCH OF the research relating to the improve- 
ment of classroom teaching has been dependent on 
measurements Carried out inthe pre-lesson 
and/or the post-lesson periods. This approach, 
valuable as it is, is indirect, and there is need 
for a more direct approach. It seems reasonable 
to suppose that direct study of the lesson by an 
observational technique would avoid certain of the 
variables encountered in the indirect approach and, 
at the same time, should possess intrinsic valid- 
ity. Simultaneous consideration of the subject 
matter taught and the methodof its development— 
two interdependent facets of the lesson—could al- 
so be achieved. 

There follows, hereafter, a description of the 
design of an instrument for direct observation of 
the verbal interaction of teacher and pupils in a 
mathematics classroom. Support for this design 
is sought in a brief empirical study of aspects of 
validity and reliability of such an instrument in- 
cluding effective ways of reporting the obse rva- 
tions. 


Design of the Instrument 





Categories related to objectives of mathema- 
tics teaching formed the basis of the instrument. 
Trained observers, sampling the verbal interac- 
tion of teacher and pupils, classified behaviors by 
means of the defined categories. This permitted 
a quantitative record of the emphasis of the select- 
ed objectives. 


The Categories 





Fairly general aims in the teaching of secon- 
dary school mathematics were selected to provide 
a basis for judgment of the developing lesson. 
These were sought in the literature deal ing with 
general educational objectives, in the literature 
concerned particularly with mathematical objec- 
tives, in writings of specific mathematical inter- 
est and from personal teaching experience. 

Categories for the classification of classroom 
behavior were selected from these statements of 
aims of mathematics teaching in the light of three 





criteria. Each aim was capable of careful defi- 
nition, important for general and for mathemati - 
cal education, and feasible of attainment in the 
secondary school. These categories fell natural- 
ly into three frames of reference: ability to think, 
appreciation of mathematics, and attitude of cur- 
iosity and initiative. 


Frame A. Conscious developing (teacher) or use 
(teacher or pupil) of ability to think by, 1. an- 
alyzing, 2. synthesizing, 3. specializing, 4. 
generalizing. 

Frame B. Conscious developing (teacher) or dem- 
onstration (teacher or pupil) of appreciation of 
mathematics—an evaluative approachto, 1. the 
methodology of mathematics, 2. the subject 
matter of mathematics, 3. the place of mathe- 
matics in other fields and areas, 4. the place 
of mathematics in history. 

Frame C. Conscious fostering (teacher) or dem- 
onstration (teacher or pupil) of an attitude of 
curiosity andinitiative, 1. enthusiasm for fresh 
knowledge, 2. independence of thoughtand ac- 
tion. 











The requirements of an instrument for system- 
atic observation (exclusiveness, exhaustiveness, 
and unity) had also to be met by this definition of 
categories. Logically and by practical explora- 
tory use, these frames (and within each frame the 
several categories) were deemed to be exclusive. 
Provision of a neutral category for behaviors not 
falling in the defined frames ensured exhaustive 
classification. Finally, because the aims were 
defined to prove necessary and sufficient to en- 
compass the aims of mathematics teaching, unity 
was assumed, 

Detailed definition of the positive and negative 
aspects of each frame and of the several categor- 
ies of each follows in Table I. 


The Classification of Behaviors in Terms of 
the Categories 








Frames of Reference— These aims of ability 
to think, appreciation of mathematics, and an at- 
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tude of curiosity and initiative constituted three 
frames of reference. Each verbal behavior oc- 
curring in the classroom was viewed in all three 
frames and was classified under one of the cate- 
gories of each. This is shown in Tablel. The 
first particular example of behavior is: 


‘*(The problem: to show that the roots of a 
particular equation are real and unequal) 
Pupil: ‘If the roots of 16x? - 16x - 51 = 0 
are real and unequal, then we should find 
the discriminant greater than zero.’ ”’ 


This was classified under Ability to Think as An- 
alyzing, positive, or A,+. In addition, this ex- 
ample occurs under Appreciation as Subject mat- 
ter, positive, or B2+, and under Attitude as Inde- 
pendence, positive, or C2,+. When not readily 
classifiable under at leasttwo ofthe above frames 
a behavior was recorded as Neutral. 

Single Behavior— The behavior observed was 
the verbal responses of teacher and pupils plus 
any concurrent blackboard development, vocal in- 
flection or facial expression contributing tothe 
meaning of this verbal interaction. 

The single behavior was obtained by time sam- 
pling; the behavior of the first speaker ina fifteen- 
second interval. During each minute of observa- 
tion two behaviors were classified. Of the four 
fifteen-second intervals occurring, the first and 
third were used for observation, the second and 
fourth for classification. 

Arrington (1) indicated strongly that the role 
of time sampling in observational techniques de- 
pends onthe ability, by its use, of obtaining more 
reliable information than would be be obtainable 
by other means. For this instrument two practi- 
cal difficulties directed its use. First, it was 
found that many naturally complete behaviors, if 
artificially divided, might be classified in parts— 
losing, however, their contextual meaning. The 
close following of the individual phrase essential 
in the study of the interaction in the classroom 
might result readily in such sub-classification of 
complete behaviors. Second, because judgment 
and recording inthree frames simultaneously was 
required, short periods had to be taken from the 
continuous interaction to complete this. Thus, 
omission of classification of variable amounts of 
the natural interaction would take place. Com- 
plete elimination of these difficulties would occur 
only if adequate recording devices, i.e., sound 
fitms, were practical. However, the two difficul- 
ties were diminished by time sampling: first, an 
interval of sufficient length to ensure classification 
of most behaviors was chosen; second, use of al- 
ternate intervals for classification permitted as- 
sessment of the amount of interaction unobserved. 

Unit of Behavior— The single les son provided 
a natural unit for obser vation of the classroom. 
Because of some variation inlength of period from 














one class to another, it was decided to limit the 
observation to forty-five minutes. A total fre- 
quency of ninety behaviors (two per minute) was 
thus obtained for each standard unit of observation. 
The observer viewed and classified behavior 
from a desk atthe back of the classroom using du- 
plicated recording sheets and a stop-watch. 


The Observers 





The classification of behaviors under the de- 
fined categories was determined by the understand- 
ing, interpretation and inference of the observers. 
Care, therefore, was directed to adequate prepar- 
ation in terms of mathematical understanding and 
of knowledge of the experimental procedures. The 
principal observer, 1, was the investigator hold- 
ing degrees in mathematics and in education and 
with experience in the secondary classroom. The 
assistant observer, 2, was a fourth-year student 
at the University who was planning to becomea 
teacher of mathematics in the secondary school. 
This student was majoring in mathematics and had 
completed most of her professional courses, but 
had no classroom teaching experience. 

Experience and Training of the Observers—Ob- 
server 1’s understanding of the problem had been 
developing during an eight-month period of exper- 
imental design. Her experience in recording was 
in the use of successively modified schedules and 
amounted to approximately sixty clock hours. 

Observer 2 was trained as follows: General 
discussion of the purpose, scope and methods of 
the problem was followed by detailed study of the 
arbitrary definitions of behaviors as collected in 
an Obervers’ Manual. Thereafter, three visits 
occurred for practice in classification of behaviors 
in the classroom. Beginning with five minute per- 
iods of classification inasingle frame only, at the 
third visit the full 45-minute unit was studied in 
all three frames. Further experience was gained 
by repeated class ification of filmed interaction, 
and by observation of fourteen units in the class- 
room itself. 

Interaction Effects of the Observers andof His 
Subjects—Seeking to minimize the effect of the ob - 
server on the subject, Observer 1 met all of the 
principals and most of the teachers before the per- 
iod of observation began, presenting the proposal 
and ans wering resultant questions. Detailed de- 
scription of the observational device was purpose- 
ly avoided. This was explained in terms of the 
possible behavioral effects and was accepted 
readily. On the first day of obser vation the ob- 
servers were introduced to the pupils as visitors 
from Washington University interested in teach- 
ing. Probably because of the custom of students 
from the University visiting in these schools, the 
pupils generally did not show any particular inter- 
est in the observers. 

Acknowledgement of the assistance given, and 
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a brief outline of results obtained was sent to each 
teacher after the analysis was complete. 


The Classes 


The twelve algebraclassrooms in which the in- 
strument was studied were in secondary schools 
in Metropolitan St. Louis in a setting of higher 
than average economic status. The classrooms 
were selected to be comparable in terms of size 
and organization of school, academic and profes- 
sional qualifications of the teachers, broad admis- 
sion requirements to high school, and heterogen- 
eous grouping as far as mathematical ability was 
concerned. 

Most of the lesson period was spent in group 
discussion conducted withthe teacher at the black- 
board, the pupils at their seats. There were oc- 
casional periods of supervised study. 


Testing of Validity 





At this stage in the development of this instru- 
ment two types of validity—content and construct 
—and reliability were considered. 

Assessment of content validity included study 
of the 15-second division at alternate intervals as 
a valid sample of the natural behaviors of a unit; 
also considered was the number of units needed 
to make adequate representation of the mean be- 
haviors of a topic. 

The study of construct validity was directed to 
assessment of logical relationship between behav- 
iors revealed by this instrument and general 
teacher and pupil characteristics. 

Reliability of the observer was assessed in two 
ways—agreement between observers, and agree- 
ment between repeated classifications of the same 
filmed interaction by the single observer. 


The 15-Second Division at Alternate Intervals 





Comparisons of the chosen 15-second fixed di- 
vision, and of a larger and shorter fixed division 
were made with a natural division of the interac- 
tion. It was essential that this be the first step 
in consideration of the design of the instrument 
because all observations inthe classroom were to 
be made by this time sampling. 


Method 


Four ten-minute films (4) wereselected as ap- 
proximating aclass period. Typescripts of the 
sound tracks of these films were prepared. Arti- 
ficial division of the typescripts into arbitrary 
time intervals of 74-second, 15-second, and 30- 
seconds from zerotime was made. This was fol- 
lowed by a division by classifiable behaviors with- 
out regardtotime. This latter division was 
termed the ‘‘natural’’ division, and a natural be- 





havior was defined as ‘‘a continuous section of the 
defined interaction, chosen without regard to time, 
forming a behavior which, when considered in its 
context, is readily classibiable in A, B, orC, 
considered separately, or as N.”’ 

Then the behavior in each 74-second interval 
was Classified by observers 1 and 2, individually, 
in each of the frames A, B, and C, or alternative- 
ly as N. Classification in each of the frames or 
as N was repeated for the 15-second and the 30- 
second intervals. It proved necessary to include 
an additional category ‘‘unclassifiable’’ because of 
the occurrence of a number of intervals, particu- 
larly in the 74 -second division with too few words 
to permit classification. 

Next, the typescripts were divided by observ- 
ers 1 and 2, individually, by natural behaviors 
classifiable in frame Aor alternatively as N. This 
was repeated for frame B and for frame C, lest 
the length of the natural interval in each frame did 
not coincide. 

Points of disagreement in a classification for 
each division were resolved by discussion, pro- 
viding a single joint classification by observers 1 
and 2. 


Results 


Totals of teacher and pupils behaviors were ob- 
tained for the four films takenasanentity. These 
totals for the 75-second, 15-second, the 30-sec - 
ond and the natural divisions were 324, 163, 82, 
and 175, respectively. 

Ratios of total teacher - pupils behaviors for 
these same intervals were 2.59, 2.60, 2.77, and 
2.02, respectively. In each fixed division, an en- 
larged teacher effect resulted. 

Further consideration of the 7>-second or the 
30-second divisions was not warranted after not- 
ing wide disparity from the natural values in their 
total number of responses, the teacher-pupil ratio, 
and the lengths of the intervals. 

A frequency distribution of the lengths of the 
natural division is illustrated in Figure 1. These 
lengths coincided inthe classification of behaviors 
in each of the three frames. 

Using the ‘‘t’’ test of significance of a single 
mean (3), probability levels of non-significant dif- 
ferences between totals of the 15-second division 
and of the natural division were obtained (5) and 
recorded in Table Il. It was notedthat in no frame 
was a Significant difference apparent (significance 
level, p = 0.05). 

Probability levels of the non-significant differ- 
ences between totals of 15-second even and odd 
sets of intervals were recorded in Table II. It 
was noted that none of the frames showed signifi- 
cant differences at the significance level p = 0.05. 
Indeed, of the twelve comparisons made, only two 
probability levels less than 0.60 occurred. 

The use of this ‘‘t’’ test nec essitated the as- 
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TABLE II 


PROBABILITY LEVELS BETWEEN 15-SECOND AND 
NATURAL DIVISIONS* 


=—_—_ 





Measure Teacher Pupil 





0.10 


(A, _B) 
P(N,U) 


(C) 
P(N,U) 0.05 1.00 0.15 





’ Teacher, pupil and class behaviors in Ability to Think, A, Appre- 
ciation, B, and Attitude, C. Symbols for Neutral and Unclassifi- 
able N,U. For A and B, n= 7; forC, n= 3; for N,U, n=1. Use 
of ‘‘t’’ test of significance of a single mean. Significance level 
p=0.05. Where p > 0.05, difference is not significant. Where 
p <0.05, difference is significant. — 


TABLE Ill 


PROBABILITY LEVELS OF SIGNIFICANT DIFFERENCES BETWEEN 
THE ODD AND EVEN SETS OF INTERVALS OF THE 
15-SECOND DIVISION* 





Measure Teacher Pupil 





Pa 0.60 0.85 
Pp j 0.85 
Po 0. 86 


P(N, U) 1.00 1.00 





. Teacher, pupil and class behaviors in Ability to Think, A, Apprecia- 
tion, B, and Attitude, C. Symbols for Neutral and Unclassifiable N, 
U. For A and B, n= 7; for C, n= 3; for N, U, n=1. Use of ‘‘t’’ test 
of significance of a single mean. Significance level p= 0.05. Where 
p > 0.05, difference is not significant. Where p < 0.05, difference is 
significant. as 
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sumption of the continuity of the variable formed 
by the differences between the frequencies of cor- 
responding categories in the fixed and in the nat- 
ural divisions. Later this assumption was avoid- 
ed by the study of each category separately. 


Discussion and Inferences 





The 73 -second division and the 30-second di- 
vision were inappropriate as approximations to 
the natural division because of wide discrepancies 
apparent in total numbers of responses, teacher- 
pupil ratios of behaviors and lengths of intervals. 

The 15-second division and the natural division 
compared closely in terms of total numbers of re- 
sponses and lengths of intervals, while the ratio 
of teacher-pupil behaviors still indicated an en- 
larged teacher effect. However, no significant 
differences were shown to occur between corres- 
ponding teacher, pupils and class behaviors in 
frames A, B, or C, or Neutral and Unclassifiable. 
Further, no significant differences occurred be- 
tween classifications of the even and odd alternate 
sets of 15-second intervals. 

Recognizing the limitations of the use of any 
fixed division in approximating the observation of 
natural behaviors and in particular the enlarged 
teacher effect present in its teacher-pupils ratio, 
the 15-second division at alternate intervals was 
shown empirically to be a valid sample of the to- 
tality of behaviors of a unit. 

Such comparison of the classifications of a 
fixed division anda natural division seems unusual 
in studies of validity of time sampling. For im- 
mediate description the criterion has been that of 
reliability. For conclusions on the general situ- 
ation the accepted criterion has been that of the 
internal consistency of the data determined by 
comparison of different samples taken in the same 
manner in the same situation. Itis noted that this 
assumes that variation other than that resulting 
from randomization does not occur between suc- 
cessive samples. In the classroom, while a gen- 
eral pattern typical of a teacher and his pupils 
may well emerge, some variation between succes- 
sive lessons will be intentional or may result from 
such other factors as change in content. 


Number of Units Necessary to Represent 
Adequately the Mean Behavior 
of a Topic 








Observation of classroom interaction using the 
15-second sampling interval was then begun. It 
was necessary to determine empirically the min- 
imum number of units of behavior which would 
provide representative estimates of mean behav- 
iors of a topic. 





Method 


Three classrooms studying the same topic in 
‘‘Algebra 3’’ were visited by Observer 1. Classi- 
fication of all units of behavior in that topic was 
carried out. The results were recorded as Class- 
rooms I, II, and III. 

Total of behaviors of each teacher and group of 
pupils were calculated for each category. For 
convenient comparison, these raw score frequen- 
cies were then adjusted proportionately in order 
that a standard total number of 100 responses was 
obtained for each class period. 

Cumulative arithmetic averages, xj, of adjust- 
ed frequencies of behaviors were calculated begin- 
ning with unit 2. Unit 1 was not considered in 
these because it was felt likely that the first les- 
son of a topic might vary markedly from the les- 
sons of the succeeding days. The means for the 
complete topic, including unit 1, were defined as 
the theoretically best estimates for each category 
for the topic: Xj. 

Frequencies of individual categories were test- 
ed for the significance of their differences from 
the theoretical frequencies by considering the 
probability of occurrence of such differences in 
the binomial distribution. 

Present in the use of this analysis was the as- 
sumption of independence of xj and xj. This as- 
sumption would prove justifiable in practice if the 
theoretical value were the mean of a very large 
number of units of behavior. The errors entailed 
by the assumption in this problem became quite 
marked as the number of units of behavior includ- 
ed in the cumulative averages increased towards 
the total number of units observed. It was felt, 
however, that consideration of individual categor- 
ies by the binomial distribution included fewer as- 
sumptions than would the ‘‘t’’ tests as used above, 
or the <* test, which was possible in numerous in- 
vestigations throughout the literature. In addition 
to the assumption present in this use of the bino- 
mial distribution, the use of x? inthe present stage 
of analysis wouldinclude an assumption of inde- 
pendence of the several categories. Whiletheir 
exclusiveness has been argued logically andim- 
posed by the method of classification, it was felt 
that empirical justificationof an assumption of in- 
dependence should precede use of x?. 


Results 


Probability levels of non-significant differences 
between successive cumulative averages and the 
theoretical averages were obtained for each cate- 
gory. A general gradual increase in the levels of 
probability occurred as the number of units includ- 
ed in the cumulative averages increased. In cer- 
tain instances, this rise was discontinuous. In no 
category nor in any classroom were consistent 
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sharp increases in the levels of probabilities ob- 
served. 

Investigation followed of the differences be- 
tween the four-day cumulative average (largest 
obtainable from all three classes) and the theoret- 
ical average in all three classes. For this aver- 
age of four units the numbers of categories not 
significantly different from the theoretical values 
was noted for probability levels: p=0.70, p = 0.80, 
and p= 0.90. These were approximately 4/5, 3/5 
and 2/5, respectively. 

Limited assessment followed of the consistency 
of the results of asingleday with the mean scores. 
At a probability level of p = 0.05, less than 1/10 
of the scores of individual categories for the 
single unit differed significantly from the mean 
category score. 


Discussion and Inferences 





In seeking a practical limit to the number of 
days necessary for observationofa classroom, it 
has been decided that the totality of behaviors in 
a topic would be assumed to represent all behav- 
iors occurring in the classroom. Cumulative av- 
erages of the units of a topic were then consid- 
ered, in an attempt to determine a small number 
of units that could be used to approximate the topic. 
It was considered that if such a small number of 
units existed within a topic, that a sharp rise in 
the probability of non-significant differences 
should occur throughout all the categories and in 
each experiment onaparticularday. Such a sharp 
rise did not occur. The steady though occasion- 
ally discontinuous rise which was met could well 
be accounted for by the increasing degree of de- 
pendency between cumulative averages and theo- 
retical values as numbers of units increased. It 
was felt, therefore, that the small number of units 
providing the theoretical values of the categories 
prevented an answer tothis problem. A profitable 
study might well result from consideration of the 
same problem using observations collected over 
a very large number of units, eliminating the as- 
sumption concerning the topic behaviors equalling 
the totality of all classroom behaviors, and min- 
imizing the dependence of cumulative averages on 
the mean of all units. 

It was necessary, practically, for the remain- 
ing observations inthis series to limit the number 
of units of observation in each classroom. It was 
decided to visit classrooms during four units of 
behavior, and presuming that the first and last 
days of a topic might well be atypical, to collect 
data only on intervening days of topic. 


The Level of Agreement of Observers in Ap- 
plying the Arbitrary Definitions of Class- 
ification of Behaviors 











As a prerequisite to validity, it was also nec- 
essary to assess the arbitrary definitions of 





classification of behaviors in the light of observ- 
er agreement. 

The test of observer reliability was twofold: 
first, the level of agreement in classification 
achieved between observers in single units of class- 
room behaviors, and second, the level of agree- 
ment in classification reached bet ween repeated 
observations by the same observer of a single unit 
of behavior. 


Method 


Two observers took part in the complete set of 
observations. Fourteen class periods were clas- 
sified by both Observers 1 and 2. In addition each 
observer made repeated classification of a single 
recorded unit of behavior at three weekly intervals 
throughout the experimental period. 


Results 


Level of Agreement in Classification Between 
Observers—Scattergrams of the corres ponding 
scores of Observers 1 and 2 in all categories of 
teacher, pupil and class behaviors for units of ob- 
servation 1, 7 and 14 are shown in Figure 2. By 
day 14, the results indicated a marked rise in ob- 
server agreement over those for units 1 and 7, 
and a high absolute achievement of ag re ement in 
the close clustering of points about the line of per- 
fect agreement. 

The probability levels of individual category 
frequencies of behaviors classified by Observer 2 
being different from those as classified by Observ- 
er 1 were recorded. Asummary of the number of 
differences non-significant at the probability level 
p = 0.70, was illustrated graphically on a linear 
scale in Figure 3. 

After the first four observations a marked trend 
towards increasing agreement was noted. 

Level of Agreement in Classification Be t ween 
Repeated Observations of a Single Unit by the Same 
Observer—Observed frequencies and deviations of 
these from ‘‘true’’ scores in each category were 
recorded for Observer land for Observer 2. Il- 
lustration was made of deviation from true values 
in Figures 4 and 5 for Observers 1 and 2 respec- 
tively. The horizontal axis represented zero de- 
viation from the theoretical frequency. 

These figures were used to compare observed 
with theoretical frequencies; successive observa- 
tions by the same observer with each other; teach- 
er, pupils and class patterns; and finally, the 
classifications of the two observers. 











Discussion and Inferences 





Realization of the lack of a well-developed 
methodology of observational techniques cited by 
Heyns and Lippitt (7) continued togrow in seeking 
to assess observer agreement. 

Use of means of observer scores in statistical 
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assessment of between observer agreement was 
recommended by such investigators as Bales (2) 
and Steinzor (8) without indication of the depend- 
ence of this mean on the individual observer 
scores compared with it. In addition, the level of 
experience of the two observers was not compar- 
able, preventing use of the technique suggested by 
Guetzkow (6). Differences between the scores of 
Observers 1 and 2 were consequently studied by 
considering the probability of Observer 2’s scores 
being different from theoretical (Observer 1’s) 
scores, by means of the binomial distribution. 

In the area of level of agreement between re- 
peated observations by the same Observer ofa 
Single unit of behavior, the probability of differ- 
ences occurring between observed and theoretical 
results also was studied by means of the binomial 
distribution. However, the nature of the scores 
themselves induced a further difficulty. Where 
frequencies in individual categories were low, and 
in particular where the theoretical frequency of a 
category equalled zero, the resultant 0.00 prob- 
ability presented a markedly distorted picture. 
Therefore, it was decided to indicate deviation 
from the theoretical results at this stage by abso- 
lute values, and toillustrate this ona linear scale 
in Figures 4 and 5. 

Level of Agreement in Classification Between 
Observers— Agreement between the two observers 
increased rapidly during the 14 units of observa- 
tion until 50/60 categories were not significantly 
different at a probability level p= 0.70. Unfor- 
tunately, no evidence of use of the binomial ex- 
pansion in a similar type of test was found in the 
literature to enable direct comparison with other 
studies. However, thelevel of agreement reached 
seemed to be entering a favorable level. 

Because only two observers were compared, 
no conclusion may be drawn beyond the immedi- 
ate situation. However, the results would infer 
success of the training methods for Observer 2. 
In addition, the level of agreement achieved al- 
lowed inference of the validity of the classifica- 
tion of the behaviors observed in this study. 

Level of Agreement in Classification Reached 
Between Repeated Observations by the Same Ob- 
server of a Single Unit of Behavior—Because of 
overall small deviation from theoretical values, 
and because of consistency in this deviation be- 
tween successive trials in the results of Observ- 
er 1 throughout the obser vational period (which 
began after the 2nd trial recorded here) was fairly 
consistent. It may also be inferred from the 
gradually decreasing values of the deviations in 
successive trialsthat some increased skill with 
additional experience was gained. Similar con- 
siderations follow from thetwotrials of Obs erv- 
er 2. 

Two categories showed marked deviation from 
the theoretical values, despite consistent scores 
in successive trials and between observers: the 

















positive andnegative aspects of curiosity. This 
effect had not been unexpected. Discussion of in- 
terpretation of behaviors in this frame of attitude 
had brought out the importance of inflection of 
voice, and the effect of overall lesson develop- 
ment. It had been considered that these two fac- 
tors might well be hidden or minimized in the close 
examination of a typescript as used in this part of 
the study. This emphasized the importance of di- 
rect classroom observation. 


Magnitudes and Variability of Mean Behaviors 
of Teachers and Pupils 








These considerations ofthe validity of the sam- 
pling interval, the minimum number of units to be 
observed, and the level of agreement of the observ- 
ers necessarily preceded interpretation of results. 

Three main questions regarding classroom be- 
havior observed were raised. First, were the 
mean behaviors in Algebra 1 classes significantly 
different from those of Algebra 3? Second, how 
should the magnitudes and variability of the se- 
lected aims be combined in describing the class- 
room? Finally, were known characteristics of 
teacher or pupils logically related to significant 
differences of behavior revealed by the instru- 
ment? 


Method 


Materials Used—lInteraction in six classrooms 
of each of Algebra 1 and Algebra 3 was classified 
by Observer 1 for four units of behavior. 

The ‘‘t’’ test of significance of the difference 
between means was applied to corresponding be- 
haviors in Algebra 1 and 3 (p = 0.05) (ref. 3). 

Scores of individual categories of teacher and 
pupils were illustrated in two wyas: in absolute 
value on a bar graph, and in relation to mean val- 
uses, by a profile comparison. In preparing the 
profiles, it was noted that scores of negative cat- 
egories were inherently negative. Thus the mean 
-lp.e. was greater in absolute value than the 
mean. It was also felt that whileacertain number 
of neutral behaviors was probably necessary for 
the conduct of a lesson, an incidence much higher 
than the mean value would suggest poor class man- 
agement. Therefore, as in the negative categories, 
the scores of the neutral category was considered 
negative. However, since it seemed likely that 
some neutral behaviors were necessary, caution 
was marked in interpreting values falling far to 
the right as indicating high performance. 

The scores of several of the twelve classes 
were then examined for differences in the observed 
behaviors that might logically be expected from 
certain known characteristics. This is exampled 
by Class XI. InClass XI, current extra-curricu- 
lar responsibilities had limited the preparation 
time of the teacher. It was felt that this might be 
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evidenced in class by an increased number of teach- 
er behaviors. Inparticular, actual planning of the 
problems to be used in the lesson period itself 
would include a greater need for teacher demon- 
stration of analyzing and/or synthesizing rather 
encouragement of this inthe pupils. It was also 
presumed that enc ouragement of initiative would 
be low to avoid revelation of inadequate prepara- 
tion. the scores in pupil initiative might be ex- 
pected to indicate whether this lack of teacher 
preparation was a common experience: relatively 
high C2_ would be likely to endorse this point of 
view. In addition, it was expected that pupil in- 
teraction would tend to contribute to the solution 
rather than the understanding of the problem—thus 
an emphasis in pupil behavior of analyzing and/or 
synthesizing and a decrease in specializing. 
Therefore, comparison of the results of Class XI 
with means of Algebra 1 was made in the follow- 
ing areas: 


1. Totals of teacher and pupil behaviors 

2. Teacher behaviors in Az4, A34, Co_, and 
Sain 

3. Pupil behaviors in Az+4, A3+, C2+, and C2- 


Results 


Means and standard deviations of raw scores 
forming representative behaviors were obtained 
for eachclassroom. From these means and stand- 
ard deviations for the Algebra 1 and the Algebra 
3 classrooms were calculated. Using the ‘‘t’’ test 
of significance of difference between means, and 
a probability level of 0.05, only twoof the twenty- 
one categories of teacher behaviors were found to 
be different. These werepositive analyzing and 
positive synthesizing. In each case the mean fre- 
quency of the teachers in Algebra 1 was less than 
the mean frequency of the teachers in Algebra 3. 
One category of pupil behavior was different. This 
was negative curiosity, in which mean Algebra 1 
was greater than mean Algebra 3. 

A bar graph illustrating the mean behavior of 
teachers and pupils in Algebra 1 is shown in Fig- 
ure 6. Most of the behaviors in each were relat- 
ed to the three frames of reference, only thirteen 
in one hundred proving neutral in these areas. 
Consideration of the number of behaviors record- 
ed for teacher and pupils separately emphasized 
the major part the teacher played in class inter- 
action. Despite one teacher and approximately 
25 pupils making up a class, five teacher behav- 
iors occurred to every two pupil behaviors. 

Using the profile of Class XI for comparison 
with mean values, the following differences were 
assessed: (1) Totals of teacher and pupil behav- 
iors of Class XI were tested for their difference 
from mean behaviors of six classrooms using **. 
Corrected for continuity, x? = 1.18. This was 





less than that of «x? at the chosen significance lev- 
el, viz., 3.84, so goodness of fit was established. 
Thus the hypothesis that the number of teacher 
behaviors would be significantly less than mean 
teacher behaviors was not supported. (2) From 
the profile comparison, Figure 7, it was noted 
that teacher frequency in positive synthesizing, 
Az+, was high, teacher frequency in positive spec- 
ializing, A3;, was low, teacher frequency in pos- 
itive independence, C,,, was low, and in negative 
independence, C2_, was average. Thefirst three 
relationships with the mean were thus shown to 
agree with the earlier hypothesis. (3) From the 
profile comparison, Figure 7, it was noted that 
pupil frequency in positive synthesizing, A,4, was 
high, pupil frequency inpositive specializing, A3, 
was low, pupil frequency in positive independ- 
ence, C24, was low, andinnegative independence, 
C,z-., was low. Thefirstthree relationships agree 
with the hypothesis. The point of view that the 
type of lessons observed occurred regularly was 
not supported. 


Discussion and Inferences 





Generally, mean behaviors of teacher, pupils 
or class in Algebra 1 were not different to those 
shown in Algebra 3. It may be inferred that the 
two samples studied were drawn from the same 
population. Thus in this instance, differences in 
specific subject matter or in age of pupils did not 
affect significantly the pattern ofbehaviors. This 
was not surprising in terms of subject matter for 
the courses represented two parts of the overall 
algebra program for the state with differences in 
content rather than in difficulty of subject matter. 
It was possible, however, that the differing ages 
of the pupils in the twosamples, and the probable 
selection of pupils preceding the course in Algebra 
3 might have resulted in general significant dif- 
ferences in pattern. Later study, then, might be 
carried out within the secondary school algebra 
program with less regard for the formal divisions 
occurring in the schools. 

The order of emphasis of categories in the aim 
ability to think was as follows: specializing, syn- 
thesizing and generalizing (approximately equal), 
analyzing. It was felt that the order of the first 
three was reasonable, but thatthe small incidence 
of behaviors in anal yzing was undesirable in the 
light of the widespread usefulness of this mode of 
thought. The order of emphasisin the aim of ap- 
preciation was as follows: methodology and sub- 
ject matter (approximately equal), other fields 
and areas, historical development (no behaviors). 
It was felt that emphasis of the first two was most 
desirable in mastery of a body of knowledge, but 
that greater emphasis inthe latter two would con- 
tribute to the pupils’ enthusiasm and their under- 
standing. 

The order of emphasis in attitude was negative 
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and positive independence (approximately equal), 
and curiosity (negligible). The low incidence of 
curiosity was disturbing in considering the poten- 
tial force for leaning unexploited. Greater em- 
phasis of positive independence was also felt de- 
sirable, although as the negative aspect included 
an absence of encouragement as well as active 
discouragement uncertainty as to the level to be 
attained was expressed. 

Examination of the profile comparison in light 
of known characteristics and expected differences 
supported the validity of the instrument, for gen- 
erally, expected differences did occur. 

Development of these two aspects of examina- 
tion of scores—in absolute value and in profile 
comparison—would seem promising in seeking a 
means of evaluating the performance with this 
instrument. 


Summary 
ee 


The design of an instrumentfor direct study of 
the secondary school mathematics classroom is 
described. This design required the selection of 
criteria for judgment in secondary school mathe- 
matics; the development of an observational 
schedule based on these criteria; and the empiri- 
cal assessment of the instrument in the class- 
room. 

As criteria for judgment, careful definition 
was made of certain aims of mathematics teach- 
ing. These aims were defined inthree frames of 
reference each of several categories: 


A—Ability to think, analyzing, synthesizing, 
specializing, generalizing; 

B—Appreciation of mathematics, methodology, 
Subject matter, other fields and areas, his- 
torical significance; 

C—Attitude of curiosity and initiative, enthusi- 
asm for fresh knowledge, and independence. 











Single behaviors inthe classroom were clas- 
sified as positive or negative achievement of 
teacher or pupils in all three of these frames of 
reference, or as neutral. The single act was the 
predominant behavior of the first speaker ina fif - 
teen-second sampling interval. Alternate inter- 
were observed, recording being carried out in the 
intervening intervals. The period of observation 
was forty-five minutes in length, resulting in the 
classification of ninety behaviors. Each class- 
room was visitedfor four such periods, the mean 
behaviors resulting in ‘‘representative’’ behav- 
iors for that classroom. 

The observers were the investigator, anda 
fourth-year student, a potential teacher of math- 
ematics. It was essential that the observer be 
skilled in subject matter as well as in the tech- 
niques of the observation schedule. 

Empirical assessment in algebra classrooms 





followed this period of design of the instrument. 
Analysis of the results permitted the following 
inferences: 

By an unusual direct comparison of behaviors 
yielded by time sampling and by a defined natural 
division of the verbal interaction, the choice of 
the 15-second interval was supported. In addition, 
a limited consideration of the consistency from 
day to day of observations by this sampling inter- 
val showed a reasonable level of agreement. Fi- 
nally, the facility of its use in the classroom was 
so marked, that the validity of this sampling in- 
terval was felt to be established. 

The second sampling procedure used was that 
of observing eachclassroom for a limited number 
of days. No conclusive answer to this problem 
was obtained. Practically, it was decided that for 
the rest of the present study the sample should 
consist of four days of observation of each class- 
room. 

A high level of consistency in interpretation of 
behaviors was shown by the principal observer, 
and to a lesser degree by the assistant observer 
who was still gaining obser vational experience. 
These findings supported the classification of be- 
haviors determined. 

The use of bar graphs of absolute values of 
teacher and pupil behaviors, and profile compar- 
isons of individual with meanvalues permitted de- 
tailed description of actual emphasis of aimsin 
the classroom and comparison of individual class- 
rooms with mean values. Consideration of cer- 
tain known characteristics of teachers or pupils 
which were in agreement with differences shown 
by the instrument supported the validity of the de- 
scriptions yielded. Such profile comparison may 
well prove the basis for development of the in- 
strument for evaluative purposes. 


General Appraisal of the Instrument 





Because the basis of judgment for this instru- 
ment is that of aims of mathematics teaching, the 
instrument is theoretically limited to use in math- 
ematics classrooms. However, the surprisingly 
general nature of the statement of aims might well 
bear consideration for application by specialists 
in other subject matter fields to their own situa- 
tion. If such application was successful, general 
systematic comparison of teacher or pupil behav- 
iors, independent of subject matter yet including 
in the observer interpretation the quality of the 
subject matter field, could be attained. 

It is noted that the observers must be exper- 
ienced in the subject matter field as well as trained 
in the particular observational procedures. 

The behaviors classified with this instrument 
are those of verbal interaction orits attendant 
demonstration. Thus the description of the lesson 
period yielded by the instrument is limited to the 
amount of verbal class interaction occurring. In 
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classrooms where long periods of supervised 
study are regularly used, this limitation may be 
so great that a picture of the in-lesson period by 
this instrument would be impossible. 

The recording of class interaction in the pres- 
ent study has beendividedinto teacher scores and 
pupil scores. The pupils’ scores are recordedas 
a group, thus no indication is available of individ- 
ual pupil performance. In addition, discussion 
between teacher and individual pupils outside of 
the class interaction period—for instance while 
some of the class is studying—has not been con- 
sidered. Thus a possible area of great teacher 
and pupil variation is untapped by the present ob- 
servational procedure. 

Currently another limitation exists in the need 
for preparation of the observer by the investiga- 
tor. This will be necessary until further study of 
the training procedures can be carried out. 
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Purpose of this Study 


THE BASIC objective of this study was the de- 
velopment of vocabulary tests that adequately 
measure knowledge of mul ti-meaning words by 
fourth-grade pupils through eighth grade. It was 
believed that this objective could be reached 
most effectively by: 


1. Constructing vocabulary tests involving se- 
lections of meaning in relation to a pupil’s 
interpretation of a word symbol. 

. Measuring more meanings per word than 
previous vocabulary tests. To illustrate 
the single meaning test, anitem taken from 
the Metropolitan Tests, Elementary Read- 
ing Test, Form R, is shown below: 

9. sack seat hold bag lift box 
The Multi-Meaning Vocabulary Tests, inthe item 
for the word ‘‘sack,’’ include several meanings 
of the word. To illustrate: 


(x) a large bag 
(_) to hold 

(x) a loose coat 
(x) to discharge 
( ) some ashes 
(x) to plunder 
(x) a white wine 
(_) to hunt 


3. Measuring more meanings in approximately 
the same time limits. Individuals were able 
to designate all the meanings of a word in 
about the same time limits as they would 
need for finding the single meaning response. 

- Determining grade levels of meaning for 
each multi-meaning word used in the test 
for grades four through grade eight. No 
claim is made that this is a classification 
of multi-meaning words according to grade 
levels, but pupils’ performance on each 
item demonstrated their proficiency or fail - 
ure in identifying the meaning at each grade 
level included in the test. 


Development of the Test 


Method of Selecting Words for the Test 





The Teacher’s Word Book of 30,000 words was 
used as a source for securing a basic vocabulary 
list. The words in this book are not classified ac- 
cording to the grade for which they are intended, 
but are arranged in alphabetical order and coded 
to frequency of occurrence; therefore, eachof the 
30,000 words was examined, and all the words be- 
ginning with the first grade and ending with the 
eighth grade were reclassified according to grade 
levels. The number of words obtained for each 
grade is shown in Figure 1 and also the total num- 
ber of words. 

It will be noted that some lists were combined, 
having one list for two grades while other grades 
had their own lists. This grouping was used inthe 
Teacher’s Word Book of 30,000 words. 











Since it was desired that the final selection of 
words for the test was the best sampling of multi- 
meaning words possible, several types of discard- 
ing were done and criteria to be met were estab- 
lished from the large initial list of 8008 words. 

The first discarding was: 


1. All prepositions, conjunctions, adverbs, 
personal pronouns, past participles, plurals and 
proper names were eliminated. The number of 
words retained after this firstdiscardis shown in 
Figure 2. The results of the first discard reduced 
the initial list to 6504. 


The next discarding was: 


2. To discard all single meaning words. Each 
of the 6504 words was looked up in the Second Un- 
abridged Edition of Webster’s New International 
Dictionary, and if only one meaning was listed for 
a word, it was eliminated from the list. 

Inspection of this graph shows that the list was 
reduced 612 words by this discard. 

Since it was considered basic to the purpose of 
the test to measure at least three meanings of a 
word, the requirement was set up thata word was 
to have more than five dictionary meanings to re- 
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FIGURE 1 
NUMBEF OF WORDS LISTED FOR EACH GRADE FROM THE 


THORNDIKE BASIC VOCABULARY LIST FOR GRADES 1-8 


Total Number of Words 
8,008 
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FIGURE 3 
NUMBER OF WORDS LISTED FOR EACH GRADE AFTER 


THE SINGLE MEANING WORDS WERE ELIMINATED 


Total Number of Words 
5,892 









































FIGURE 4 
NUMBER OF WORDS LISTED FOR EACH GRADE AFTER 
FIVE OR LESS MEANINGS WERE ELIMINATED 


Total Number of Words 
2,640 
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main on the list. This was reasonable because 
the majority of words have connotations and nu- 
ances of meaning that are not concepts and could 
not be accepted. 


The third discarding was: 


3. All words having five meanings or less. 
All of the word lists for each grade were reduced 
considerably. Little difference exists between 
the total number of words per grade. 

Each word now on the list was looked up in the 
Thorndike-Lorge Semantic Count of English Words. 
This study appears only in mimeographed form 
and gives the relative frequency of occurrence of 
each meaning per mille. 

A premise was set up that a meaning to be used 
in the test had to be atthe .050level. This means 
that of 1000 occurrences of the word, a meaning 
was used 50 times. Many meanings were elimin- 
ated, but no word was discarded through this 
screening because a multi-meaning word would 
have at least one common meaning above the .050 
level to be classified in Thorndike’s Basic Vocab- 


ulary List. 


The fourth discarding was: 





4. Words like do, make, like, come, et cetera, 
having one over-worked meaning, and the others 
used infrequently, were discarded. Nearly 50 per- 
cent of the words were discarded. 

There were still too many words to use in a 
preliminary test, because each key word selected 
would test on the average of five meanings, which 
would mean, if no further discarding was done, 
that over 7000 meanings would be tested. It would 
be almost impossible to secure time enough for 
children in a public school to do such a lengthy 
project. 


The last discarding was: 


5. To tabulate the frequency of occurrence of 
each meaning of the 1572 multi-meaning words. 
A frequency average percent of meaning was found 
for each multi-meaning word. 

The final list consisted of 321 mul ti-meaning 
words and 1984 meanings were measured. Grades 
1 and 2 had the highest number of multi-meaning 
words after they were averaged for highest fre- 
quency of occurrence of meaning. Many simple 
words have high levels of meaning; therefore, the 
final selection was justifiable. 


Design of Test 


Since these tests were developed on the princi- 
ple of measuring multi-meaning words, the first 
essential was to findout if pupils recognized 
words that had more than one meaning. 





Sub- Test I was labeled a Multi-Meaning Word 
Recognition Test. A sample item is shown below: 
lettuce moonlight paddle squash _ port 

The pupil underlined the words which have 
more than one meaning. Thepreliminary test con- 
sisted of four units, 50 items to a unit with five 
words in each item and a total of 803 multi-mean- 
ings were presented to each pupil for recognition. 
An item analysis was done onthis test, and the 
final tests consisted of three forms, 40 items to 
aform. A total of 484 multi-meaning words was 
tested. 

These vocabulary tests, exceptfor the recogni- 
tion test, purport to measure more meanings per 
word than other vocabulary tests; therefore: 

Sub- Test II was a Multi-Meaning Identification 
Test. A multi-meaning word was listed along 
with its various cOmmoOn meanings according to 
highest frequency of occurrence of meaning. In 
the preliminary test each multi-meaning word was 
listed with ten alternatives, such as items like 
these: 


machine for weighing 
to climb up 
to make a drawing 
according to a gradu- 

ated table ( 
series of tones ( 
a yardstick ( 
a system of numbering ( 
a barometer ( 


A total of 1984 meanings was measured in this pre- 
liminary sub-test. 

After the refining was done on each item, the 
final test was organizedintothree forms, with 50 
items and seven choices for each key word. The 
total number of meanings tested inthe three forms 
was 690. 

A factor of importance in vocabulary testing 
seems to be the part a word plays in a true con- 
textual reading situation versus adefinition knowl- 
edge. Multi-meaning words were put in context- 
ual paragraphs. Ninety-six paragraphs were writ- 
ten for the preliminary test, and 294 meanings 
were tested, for instance: 


The stagecoach stopped before the old( ) lodge. 
The passengers were glad to see the lights burn- 
ing as they entered to ( ) lodge for thenight. One 
of the men said that he would ( ) lodge acomplaint 
because he was being delayed on his journey. 


1. tofind fault 2. airport 3. rather angry 
4. totake rest 5. tobe weary 6. an inn 
7. not to be disturbed 


Sixty-one paragraphs were obtainedfor the 
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FIGURE 5 
NUMBER OF WORDS LISTED FOR EACH GRATE AFTER WORDS HAVING ONLY 


ONE MEANING OF HIGH FREQUENCY OF OCCURRENCE WERE ELIMINATED 


Total Number of Words 
1,572 


















































FIGURE 6 
NUMBER OF WORIB LISTED FOR EACH GRADE 
AFTER THE FINAL DISCARDING WAS DONE 


Total Number of Words 
321 
































128 JOURNAL OF EXPERIMENTAL EDUC ATION 


three forms of the final test after the item anal- 
ysis was done, and 183 meanings were measured. 


Trial Administration 


Following completion of preliminary editing, 
trial administrations of the items were conducted. 
The purpose of these try-outs was two-fold. The 
first objective was to determine the difficulty of 
eachitem for the grades for which the tests were 
intended. This information would be useful in ar- 
riving at a final list of words constituting atest 
with a difficulty level appropriate for each indi- 
vidual taking these tests. The second purpose 
was to identify the ineffective items, as such 
items would have no value and would not be includ- 
ed in the final test. 

In order to determine the difficulty of each 
item and to identify the ineffective items, trial 
administration of the tests was given to a sample 
population of 491 pupils in grades 4, 5, 6, 7 and 
8 in Dover and Nashua, New Hampshire. 

The pages in each test were put in different 
chronological order so that the first items would 
not always be first, and the last items would not 
always be last each time. 

The tests were given in twenty-two different 
sittings of fifteen minutes each for eleven consec- 
utive days, one test in the morning session and 
one in the afternoon session. 

Keys were prepared for each test. All of the 
tests were hand-scored by persons who had had 
much experience in correctings. 


Item Analysis 


Data showing the distribution of the items ac- 
cording to difficulty level were tabulated. 

A frequency count Was made on each of the 
items in the Recognition, Identification, and Con- 
test Tests. This was done on the results of the 
preliminary tests for grades 5, 6 and 7, because 
it seems that the average would be more consist- 
ent in the middle grades for which the test was de- 
signed. Items were chosen for the final forms 
nearest the 50 percent difficulty level and 50 per- 
cent ease level, because by this criterion no item 
would be too easy or too hard for the group tested. 

The Walker-Cohen test of significance was 
used to determine the discriminating power of 
items. Each word was analyzed, and if the word 
under consideration had good discriminating 
powers a much higher percent of pupils with high 
total scores chose the correct response than the 
percent of pupils in the lowest total score group 
choosing the correct response. 

The item analysis included: 


1. Samples of Recognition Items Accepted 
and Rejected 
Table I represents sample words il] ustrative 





of the type of words accepted for the Recognition 
Test. 

A number of words lacked discriminating 
power, and the average levels of difficulty were 
either too easy or too hard and were rejected. 
Samples of these words are listed in Table II. 

It was found that from the initial 803 multi-mean- 
ing words used in the try-out recognition test, 484 
fulfilled the requirements onthe difficulty level, as 
the average centered on the 50 percent difficulty and 
easelevel. The discriminating level of .01 alsoes- 
tablished that there was only one chance in a hundred 
the pupil could guess the correct response on these 
words. They alsoshowed a gradual increase from 
grade to grade, indicating the higher the grade, the 
more success the pupils had on the words. 


The next item analysis included: 


2. Samples of Identification Test Items 

Accepted and Rejected 

To illustrate the item analysis process on 321 
key words and 1984 meanings, sample items ac- 
cepted for the Identification Test are shown in 
Table III. 

Items like the above with gradual levels of dif- 
ficulty for most of the meanings and good discrim- 
inating powers were retained in the final forms. 
Frequently a meaning fulfilled the requirements 
except that success at different grade levels var- 
ied in range, or there were small minus differ- 
ences at higher grade levels. It could hardly be 
expected that responses on sO many meanings 
would always show comparable increases; there- 
fore, meanings were accepted if they showed good 
discriminating power, and the majority of mean- 
ings for the key word had gradual increases of suc- 
cess, and the average difficulty and ease level 
centered on 50 percent. 

Most of the words eliminated in the identifica- 
tion test showed poor discriminating power. Sam- 
ple items showing this factor are presentedin Ta- 
ble IV. 

After the item analysis, it was found that 170 
words were ineffective, and they were discarded. 
The number of multi-meaning words used in the 
final forms of the identification test was 151, and 
measured 690 meanings. 


3. Samples of Paragraph Test Items 

Accepted or Rejected 

An item analysis determined the paragraphs 
that were desirable from a standpoint of level of 
ease and difficulty and good discriminating power. 

A sample paragraph is presented to show the 
design of this test, and the number of meanings 
presented for identification: 


When the village bell rang out the wives 
of the fishermen rushed to the shore to greet 
the men. Their faces would ( ) beam as 





BERWICK 


TABLE I 


SAMPLE OF RECOGNITION TEST WORDS ACCEPTED ON THE BASIS OF 
DISCRIMINATING POWER AND GRADUAL EASE OF DIFFICULTY 





Grades 


Average Percent Discrimin- 
of Difficulty ating Level 











TABLE II 


SAMPLES OF RECOGNITION TEST WORDS REJECTED ON THE BASIS OF 
DISCRIMINATING LEVEL AND DIFFICULTY AND EASE LEVEL 





Grades 
6 





Average Percent Discrimin- 
Word , of Difficulty ating Level 





Latch 
Lime 
Quail 
Wax 
Harp 
Elder 
Ham 
Organ 
Yoke 
Barrier 
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TABLE I 


SAMPLES OF IDENTIFICATION TEST ITEMS ACCEPTED ON THE BASIS OF 
DISC RIMINATING POWER AND GRADUAL EASE OF DIFFICULTY 





Grades 

6 
Average Percent  Discrimin- 
Key Word (Palm) % of Difficulty ating Level 








Meanings Tested 
Tropical tree 61 70 72 
Emblem of victory 24 33 44 
Flat part of hand 57 68 81 
Measure of length 
To pass by fraud 16 29 34 





Average difficulty of all meanings 





Grades 


: Average Percent Discrimin- 
Key Word (Crown) ) % of Difficulty ating Level 








Meanings Tested 
A royal headdress 86 
Topmost part 
of head 55 
British coin worth 
five shillings 18 
Part of tooth 27 
To invest with 
regal power 24 
To hit on head 36 
Crest of a bird 27 
To curve upward 15 





Average difficulty of all meanings accepted 





* Very few key words had every meaning showing gradual gain, but discriminating power was 
acceptable so it was included in the final test. 


**Meaning excluded on the basis that it lacked discriminating power. 
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they saw the sails in the distance. A ( ) 
beam of light would shinein the water, and 
later they could make out the nets hanging 
from a ( ) beam on the stern. 


1. aradio 2. aheavy timber 3. show worry 
4. smileradiantly 5. candlelight 6. heavy 
cable 7. a ray of light 


The item analysis determined the paragraphs 
that were desirable from a standpoint of level of 
ease and difficulty and good discriminating power. 
Table V includes responses on the above para- 
graph, and were the type accepted for the final 
forms of the test. 

An illustration is given in Table VI of a para- 
graph that was rejected on the basis of poor dis- 
criminating power. 


As the flood rushed toward the old dam 
the men worked desperately to ( ) bar it 
from overflowing onto the highway. Sand- 
bags were used as a( ) bar where the dam 
was weak. Completely out of sight was the 
( ) bar that usually reached out into the 
channel. 


1. arushing sound 2. sand bank 3. pre- 
vent 4. danger signs 5. support 6. make 
bare 7. barrier 


Paragraphs like the above had to be discarded 
because the responses were influenced too strong- 
ly by the chance factor. 

The number of paragraphs in the preliminary 
context test was ninety-six. Sixty-one para- 
graphs fulfilled the requirements set up for items 
to be used in the final context tests. 


Test Developments 


After each item was analyzed, the final forms 
were made, using all the words that fulfilled the 
requirements postulated for good test items. 

In order to ensure balanced items in each form 
of the three tests, the words were grouped ac- 
cording to average levels of difficulty and ease. 
The items were arranged in blocks of five, and 
an average was taken after every fifth item to 
equalize each form. Thus, pupils who would 
not finish the test would always encounter similar 
items in any form of the tests at every fifth point. 

The words were placed as nearly as possible 
in each form to measure the same number of mean- 
ings. Occasionally this was impossible because 
if they balanced in difficulty, they did not in num- 
ber of meanings at every fifth item. When this 
occurred, they were balanced inthe next block of 
five items. 

Table VII shows in schematic form the plan 
for assigning words to each form. 





Statistical Data on Population 


To secure data bearing upon the type of popula- 
tion reliability and validity of the constructed tests, 
the following tests were given to 491 pupils in 
grades 4 to 8 in the public schools at Dover and 
Nashua, between November 1 and December 19, 
1951. 


1. California Short-Form Test of Mental Matur- 
ity, Elementary S-Form, Grades 4-8 (1938). 
2. Gates Reading Survey forGrades 3 (2nd Half) 
to 10 (Vocabulary, Level of Comprehension, 
Speed and Accuracy). 
. Multi-Meaning Vocabulary Tests (Rec ogni- 
tion, Identification, and Context), Form A. 
. Multi-Meaning Vocabulary Tests (Reco gni- 
tion, Identification, and Context), Form B. 
. Multi-Meaning Vocabulary Tests (R ec ogni- 
tion, Identification, and Context), Form C. 


Analysis of Data 


The sensitivity of any test as a measuring in- 
strument depends upon the ascending order of 
means between grades. Every test should serve 
to distinguish relative positions of groups as re- 
gards the trait which is measured. Table VIII il - 
lustrates the differences between the mean raw 
scores and standard deviations of each grade tak- 
ing the recognition test. 

The greatest apparent differences in mean 
scores were between grade 4 and grade 5, an av- 
erage difference of 21 points; then grade 5 and 
grade 6, an average difference of 10 points, and 
grade 6 and grade 7, an average difference of sev- 
en points. Grade 8’s difference was three points 
lower on Form A, and practically equal on the 
other two forms to grade 7. 

Data showing the mean scores for grade 4 
through grade 8 obtained on the identification test 
are shown in Table IX. 

Whereas, the mean scores for each grade show 
an increasefrom one gradetothenext, an average 
difficulty of 5 points between grade 4 and grade 5, 
of 21 points between grade 5 and grade 6, of 13 be- 
tween grade 6 and grade 7, the average difficulty 
between grade 7 and grade 8 drops to 4. 

Data showing the mean scores obtained on the 
context by pupils in grade 4 through grade 8 are 
represented in Table X. 

A striking fact lies in the failure of grade 8 to 
make little or no gains on any of the tests. 

The gains for the other grades were consistent 
and a steady increase was noted between mean 
scores of consecutive grade levels. 

The final forms of the tests were not adm inis- 
tered, but re-scoring was done on each of the 
items accepted for the final testsfrom the try-out 
tests. The advantages of re-scoring after tests 
have been refined are: 
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TABLE VI 


EQUALITY OF ITEMS FOR THE THREE FORMS OF THE TEST 





Average Percent Discriminating Number of 
of Difficulty Level Meanings 





Form A 








. Date 

. Guard 
. Order 
- Match 
. Grant 





Totals: 
Form A - Average Percent of Difficulty, 49; Discriminating 
Level, .01; Number of Meanings, 23. 
Form B - Average Percent of Difficulty, 49; Discriminating 
Level, .01; Number of Meanings, 2. 
Form C - Average Percent of Difficulty, 49; Discriminating 
Level, .01; Number of Meanings, 22. 





Note: Inspection of the table shows that the items were balanced in dif- 
iculty and discriminating levels. 
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TABLE XII 


CORRELATIONS BETWEEN SCORES ON THE CONSTRUCTED MULTI-MEANING 
SUB-TESTS AND READING AGE FOR GRADES 4-6-8 PUPILS 





No. 


Factor Grade Cases Recognition Identification Context 





R.A. 97 . 778 + .04 . 816 + .03 . 830 + .03 
R.A. . 873 + .02 . 887 + .02 . 876 + .02 


R. A. . 826 + .03 . 881 + .02 .871 + .02 





TABLE XI 


CORRELATIONS BETWEEN SCORES ON THE CONSTRUCTED MUL TI-MEANING 
SUB-TESTS AND CHRONOLOGICAL AGE OF GRADES 4 - 6 - 8 PUPILS 





No. 


Factor Grade Cases Recognition Identification Context 





C.A. 97 -.257 + .03 -.255 + .05 -.277 + .02 
C.A. 


* 99 -.117+.10 -. 453 + .08 -.430 + .08 


98 -.115 + .10 -. 343 + .09 -.441 + .08 
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TABLE XIV 


CORRELATIONS BETWEEN SCORES ON THE CONSTRUCTED MULTI-MEANING 
SUB -TESTS AND GATES VOCABULARY TEST FOR GRADES 4 - 6 - 8 





Test 


No. 
Grade Cases Recognition Identification Context 





Gates 
Vocabulary 


Gates 
Vocabulary 


Gates 
Vocabulary 


. 755 + .04 . 804 + .04 .799 + .04 


. 876 + .02 . 883 + .02 . 881 + .02 


. 826 + .03 . 882 + .02 .871 + .03 





TABLE XV 


CORRELATIONS BETWEEN SCORES ON THE CONSTRUC TED MULTI-MEANING 
SUB-TESTS AND THE GATES COMPREHENSION TEST FOR GRADES 4 - 6 - 8 





Test 


No. 
Grade Cases Recognition Identification Context 





Gates 
Compre. 


Gates 
Compre. 


Gates 
Compre. 


716 + .05 


.759 + .04 


.802 + .04 
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TABLE XVI 


CORRELATIONS BETWEEN THE THREE SUB-TESTS OF THE MULTI- 
MEANING VOCABULARY TEST FOR GRADES 4 - 6 - 8 





Tests I-III Tests I-III Tests II-I0 





. 828 + .03 . 834 + .03 .821 + .03 


. 885 + .02 .897 + 02 . 885 + .02 


. 878 + .02 .817 + .03 . 861 + .03 





TABLE XVII 


CORRELATIONS BETWEEN SCORES ON THE THREE FORMS OF THE 
MULTI-MEANING VOCABULARY TEST FOR GRADES 4 - 6 - 8 





No. Forms Forms Forms 
Grade A-B A-C B-C 





. 850 + .03 .812 + .04 .813 + .04 
. 831 + .03 . 898 + .02 , 822 + .03 


. 872 + .03 . 827 + .03 . 869 + 03 
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1. The results were obtained from the same 
population without having them take the test 
a second time. 

. No complete form was given at the same 
time as the items had not been selected for 
the final tests when the preliminary tests 
were given. 

3. All the tests were corrected a second time. 


Validity of the Tests 


In determining the validity of any test, it is 
necessary to find the correlations between the cri- 
teria used. 


California (p. 110) 

Reading Age 

Chronological Age 

Gates Reading Survey for Grade 2 (2nd half 
to 10) 

(Vocabulary, Level of Comprehension, 
Speed and Accuracy) 


Table XI gives the coefficient of correlation 
found between mental age securedfrom the above 
mental test andthe scores on the sub-tests of the 
Multi-Meaning Vocabulary Tests. 

It is recognized that the criterion of mental 
age represents as satisfactory an independent in- 
dex of the ability necessary for vocabulary as can 
be secured. 

The use of another factor was made to assure 
the validity of the multi-meaning tests. Reading 
ages secured from the results of theGates Read- 
ing Survey were obtained. Table XII shows the 
correlation found between this factor and the con- 
structed tests. 

The correlation with reading age is high, since 
the task of identifying and knowing the definition 
of words is a component of reading. Table XIII 
presents coefficients of correlations between 
chronological age and the scores obtained on the 
sub-tests of the Multi-Meaning Vocabulary Tests. 

The correlations between chronological age 
and the multi-meaning vocabulary sub-tests in 
each instance were found to be negative. This 
finding denotes many of the oldest pupils in the 
sample obtained the lowest scores and many of the 
youngest pupils had the highest scores. 

To secure additional data bearing upon the va- 
lidity of the constructed tests, correlations be- 
tween subtests in the Gates Reading Survey Test 
were found. This test is divided into three sub- 
tests and a correlation between the vocabulary 
test and the multi-meaning vocabulary tests was 
made. 

Table XIV presents data on the correlations 
that have been found. 

All correlations between the Gates Vocabulary 
Test and the multi-meaning sub-tests are high. 
This finding indicates that the pupils who made 





high scores in the Gates Vocabulary Test made 
high scores on the constructed tests while pupils 
making low scores on the Gates Vocabulary Test 
made low scores on the multi-meaning Vocabu- 
lary Tests. 

With the total scores obtained on the Gates Com- 
prehension Test, acorrelation was made with the 
Multi-Meaning Sub-tests. Table XV shows the 
findings. 

Gates Comprehension Test andthe Multi-Mean- 
ing Vocabulary Sub- Tests are all statistically high. 
It is significant that these two Gates sub-tests cor- 
relate about equally well with the constructed 
multi-meaning vocabulary tests, and that there 
are consistently high correlations. 


Reliability of the Tests 


Basic statistical procedures were necessary 
to determine the reliability of the constructed 
tests. Coefficients of reliability for the sub-tests 
were found. Table XVI shows the relationships 
on these three sub-tests. 

It was to be expected that pupils who obtained 
a high score on one of the sub-tests would be like- 
ly to secure a high score on a similar test which 
measured a like skill differing only in application 
of technique. 

The large number of multi-meaning words 
used in the preliminary tests made it possible 
after the test was refined toorganize the material 
into three forms. Each form included a recogni- 
tion, identification, and context test. In order to 
ascertain whether or not the pupils obtained a cor- 
responding score on each form of the test, coeffi- 
cients of reliability were found. Table XVII pre- 
sents a comparison of the correlations obtained 
on Forms A, B, and C of the Multi-Meaning Vo- 
cabulary Test. 

High correlations were found among all the 
forms of the Multi-Meaning Vocabulary Test. In 
view of the fact that eachform was equally weight- 
ed in difficulty, discriminating level, and number 
of meanings, it is not surprising that such high co- 
efficients of reliability were found. 


Conclusions 


Basic statistics for the evaluation of this test 
were presented. The following facts are apparent: 


1. More words have been tested in approxi- 
mately the same time and space as a single mean- 
ing vocabulary test. 

2. The tests are highly valid. 

3. The different forms and the sub-tests show 
very high reliability. 

4. A more substantial inventory of pupils’ 
knowledge of words which have more than one 
meaning is secured. 
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THE DEVELOPMENT OF THE EARLY SCHOOL 
PERSONALITY QUESTIONNAIRE * 
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University of Arizona 
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University of Illinois 


IN THE course of an extensive investigation of 
personality in middle childhood, a 200-item group 
questionnaire was constructed. It was adminis- 
tered to two samples of first- and second-grade 
children. One sample consisted of 151 children 
from the public schools of Mahomet and Rantoul, 
Illinois. The other consisted of 181 children in 
Decatur, Illinois. The item data from the two 
samples were subjected independently to factor 
analysis. Each study culminated in 18 obliquely 
rotated factors. This work was reported in two 
earlier articles (4,7). In one of these articles 
(7), the alignment of the two sets of factors was 
considered. When a combination of criteria was 
applied, it was found that there was reasonably 
clear matching for 11 factors from either study 
with a corresponding set of 11 factors from the 
other study. 


Construction of the ESPQ Scales 





It was decided that the eleven matching factor 
pairs represented the most stable, reproducible 
factors underlying the questionnaire data, and 
these were accordingly chosen for inclusion ina 
new factored questionnaire test. In addition, a 
large-variance factor appearing in the later, but 
not the earlier, study was accepted for represent- 
ation in this test. This decision seemed justified 
by the fact that the later study utilized alarger 
sample and yielded a closer approximation to 
simple structure. In addition, it was thought de- 
sirable to include a thirteenth factor of general 
intelligence. Because of the nature of the origin- 
al item battery, of course, such a factor had not 
appeared in either analysis. 

For each of the 12 personality factors, two 
six-item scales were constructed. Each scale 
met the usual criteria for factor-scale construc- 
tion: a) adequate representation of the given fac- 
tor in terms of item correlations with the given 
reference vector in both studies, b) suppression, 
or balancing out, of irrelevant factor variance, 
c) an equality of a and b keyed alternatives. 








For intelligence, two eight-item scales were 
constructed. To insure adequate breadth, each 
scale was assigned items of four different types: 
synonyms, reasoning, subordination, and super- 
ordination. These were designed to blend in form 
with the personality items, and they utilize the 
same two-alternative response form. In difficulty 
level, they were aimed at the six-to-eight-year 
level. In the course of standardization, data will 
also be gathered on a battery of alternative items 
so that any necessary item replacements can be 
made. 

Thirteen pairs of scales were thus constructed. 
These were segregated into two questionnaire 
forms (Forms A and B) for a test which has been 
designated the Early School Personality Question- 
naire (ESPQ). Eachform contains 12 six-item per- 
sonality scales and one eight-item intelligence 
scale, making atotal of 80items. The correspond- 
ence of the factor scales, considered in final nu- 
merical order, tothe factors in the two studies is 
shown in the following list (where ‘‘D’’ refers to the 
study of Decatur children and ‘‘MR’’ to the earlier 
study basedon data from Mahomet and Rantoul): 


MR 2 (-) 
MR 4 
MR 3 
MR 1 


MR 5 

MR 6 

MR 15 

. D9, MR 10 (-) 
. DIO, MR 11 (-) 
. D13, MR 7 (-) 
. D15, MR 14 

. Intelligence 


Because of the brevity of the scales, it is as- 
sumed that neither form will be used in isolation, 
except in rare circumstances. The division into 
two forms, however, should facilitate two-session 
administration. Each form will require 30 or 40 


*The research reported in this article was supported by a grant from the Department of Public Welfare 
of the State of Illinois. 
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minutes of testing time. Administration of the 
two forms at a single session would be advisable 
only with older subjects who can accept a fairly 
rapid pace. 

An answer form and a key have been specially 
designed to permit easy and rapid scoring. For 


each form, there is a four-page answer form. 


Each page contains twenty rectangular boxes ar- 
ranged in two columns. In the middle of each box 
is an item number, as well as a picture which can 
serve the same purpose for younger subjects. 
There is an A at the left end of each box anda B 
at the right end. - 

The standard instructions are an adaptation of 
those employed in the original researches. For 
each item, the examiner first directs the class to 
the appropriate box and then reads the item. Mark- 
ing instructions are given to identify each alter- 
native with the Aor the B. The children are 
asked to indicate their responses by drawing a line 
through either the A or the B in each box. 

To give the reader some idea of the content of 
each scale, we shall list the questions utilized for 
each factor scale. The keyed alternative corres- 
ponding to the high end of the scale is underlined. 


Factor 1 


Form A 


1. Would you rather play: a) school, or b) cow- 
boys and Indians? cm 
2. Which would you rather be: a) a doctor, or b) 
a teacher? 
11. Would you rather: a) fly an airplane, or b) be 
a teacher? 
21. Do you like to talk to teachers? a) yes, or b) 
no. 
31. Do you ever talk back toyour mother? a) yes, 
or b) no. 








32. Would you rather: a) go to a party, or b) stay 
home and play? 





Form B 


1. If you were in a play, would you rather be: a) 
a teacher, or b) a hunter? 
2. If another child has your coat, do you: a) take 
it away from him, or b) tell the teacher? 
11. a) Do you like tosee other children cry, or b) 
does it make you sad? 
21. Would you rather: a) be in a play, or b) make 
something out of wood? 
31. Do you ever feel like running away from home? 
a) yes, or b) no. 
32. Would yourather have: a) a new baby come to 
live with you, or b) a little dog come to live 


with you? 

















Factor 2 


Form A 


3. Do you shiver when you hear a squeaky door 
or chalk scraping on the blackboard? a)Yes, 
or b) no. a 

12. Do you like: a) all the things your mother 
gives you to eat, or b) only some of them? 

13. When your mother tells you you can’t do some- 
thing, do you want todoit even more? a) Yes, 
or b) no. — 

22. Does it bother you if people say you don’t lis- 
ten? a) Yes, or b) no. 

23. When you are told todo something or put some- 
thing away: a) doyou always do it right away, 
or b) do you sometimes forget what you are 
supposed to do? 

33. a) Do you have a lot of fun, or b) do things 
sometimes go wrong? 














Form B 


3. Do grown-ups ever say you daydream too 
much? a) Yes, or b) no. 

12. Would you rather: a) play by yourself, or b) 
play with other boys and girls? 

13. Do you ever feel like crying when you see 
something sad in the movies? ‘a) Yes, or b) 
no. 

22. Do people ever call you naughty and mischie- 
vous? a) Yes, or b) no. 

23. a) Are other children always nice to you, or 
b) do they sometimes pick on you? 

33. a) Are you happy all the time, or b) do you 
sometimes get sad? 











Factor 3 


Form A 


4. a) Are your dreams usually nice, orb) do they 
scare you? 

5. Would you rather: a) go on a long trip in the 
car, or b) go to school? 

14. When your friends want to play a different 
game, do you: a) play their game, or b) go 
on with your game? 

24. When you lose a book, do you: a) cry, or b) 


just laugh? 
34. Which do you like better: a) funny books, or 


b) the books you have in school? 

35. Would you rather tell your mother and father: 
a) about school, or b) about agame you played 
with your friends? 











Form B 


4. Do you like to play: a) hard games that you 
can win, or b) easy games where nobody wins ? 
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. Would you rather: a) work at home, or b) go 
to school ? ay 

. Do people ever say you talk too much or call 
you a chatterbox? a) Yes, or b) no. 

. Do you think: a) school is hard, or b) school 


is easy? 

° Would you rather: a) watch a game, or b) 
learn something in school ? 

. a) Do you think grown-ups shouldlistento you 
and help you more than they do, or b) do you 
like it better when they just leave you alone? 











Factor 4 


Form A 


6. Would you rather have: a) afriend who can 
read well, or b) a friend who is good at ball 
games? 

15. Would you rather: a) climb a tree, or b) look 
at a book? * 

16. Do you think: a) everybody likes you, orb) 
only some people? 

25. Would you rather look at: a) comic strips that 
are funny, or b) comic strips with a lot of 
fighting and shooting in them? 

26. Do people ever say you’re stuck-up? a) Yes, 
or b) no. 











36. a) Can you touch abig bug, or b) are you afraid 
to touch bugs? 





Form B 


6. Would you rather: a) color abook, or b) climb 

a tree? 

15. Would you rather: a) hunt for birds, or b) 
draw pictures of birds? 

16. Would you rather: a) watch people dancing, or 
b) hear a story about airplanes? 

25. Would you rather: a) talk toa friend, or b) 
look at funny books ? 

26. Which do you like better: a) home, or b) 
school ? 

36. Do you like to climb trees? a) Yes, or b) no. 











Factor 5 


Form A 


7. Would you rather: a) talk to your friends, or 
b) talk to your teachers? 

8. Would you rather: a) stay at home, or b) go 
shopping with your mother? 

17. a) Do you like to talk to your teacher, or b) 
are you sometimes a little afraid to? 

27. When your friends start fighting: a) do you 
just leave them alone, or b) do you tell them 
to stop it? 

37. When you get angry: a) do you shout, or b) 
just cry? 

38. Are you as good-looking as the other children 




















in your class? a) Yes, or b) no. 


Form B 


7. 


Would you rather: a) playa noisy game where 
you pretend to be wild animals, or b) listen 
to a story read by your teacher? 





. When somebody bawis you out: a) do you cry, 


or b) do you get mad? 





. When your mother and father are busy, do you 


like to help them? a) Yes, or b) no. 


. If you had to make yourbed: a) would you lis- 


ten to the radio first and then make it, orb) 
would you make it right away? 





. a) Do you like to talk to your father, or b) 





would you rather not talk to him? 


. Do you like to cross busy streets? a) Yes, or 


b) no. 


Factor 6 


Form A 


9. 


18. 
19. 


28. 


29. 


39. 


a) Do you like for other children to play with 
your toys, or b) does it sometimes bother 
you? 

Do you cry: a) more thanother boys and girls, 
or b) less? 

Do people boss you around too much? a) Yes, 
or b) no. 

a) Are you always very careful how you move, 
or b) do you sometimes rusharound when you 
play and knock things over? 

Would you rather be: a) a grown-up, orb) a 
baby ? 

Do your friends sometimes say: a) bad things 
about you, or b) only true things? 











Form B 


9. 


18. 


19. 


28. 
29. 


39. 


If you get upset or sad: a) do you get ha 
again pretty soon, or b) do you stay sad for a 
long time? — 

When you get hurt, do you: a) cry, or b) try 
to keep from crying? 

Do people sometimes punish you when you 
haven’t done anything wrong? a) Yes, or b) 
no. 

Do you like to play: a) new games, or b) just 
games you already know’ 

When you get hurt: a) do you just try to forget 
about it, or b) do you sometimes cry? 

Is it sometimes hard to get people to under- 
stand what you’re saying? a) Yes, or b) no. 











Factor 7 


Form A 


41. 


Would you rather: a) listen to a story, or b) 
watch two dogs fight? 
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42. Who usually has better ideas: a) you, or b) 
your friends? 

51. Are you pretty good at: a) everything, or b) 
just a few things? 

61. a) Are you getting along pretty well, or b) do 
you have a lot of problems and troubles? 

71. Do you like to tell stories to other people? 
a) Yes, or b) no. 

72. Do you usually: a) do what other people say, 
or b) do what you want to do? 








Form B 


41. Which do you like better: a) hearing stories 
about what a boy or girl does, or b) really 
doing things yourself? 

42. a) Does your mother let youdo almost anything 
you want, or b) are there lots of things she 
won’t let you do? 

51. a) Are you stronger thanother children, or b) 
are they stronger than you? 

61. When you argue with people: a) do you some- 
times find out that you were wrong, or b) are 
you nearly always right? es 

71. When you want to say something: a) do you 
just say it, or b) do you think it over first? 

72. a) Do you have to do things you don’t want to 


do, or b) do you always do what you want to? 

















Factor 8 


Form A 


43. Do you like: a) to play with dogs, or b) to 
stay away from them? 

52. Would you rather: a) go to a party, or b) lis- 
ten to the radio and TV? a 

53. Do you get into: a) more trouble than most 
other children, or b) less trouble? 

62. a) Do you like todo things the way your mother 
says, or b) is your own way better? 

63. en children play tricks on you: a) do you 
cry, or b) do you get mad? 

73. Whichdo youlike better: a) cats, or b) dogs ? 




















Form B 


43. Which would you rather be: a) a dog, or b)a 
cat? 

52. a) Are you always neat and tidy, orb) are you 
sometimes careless and messy? 

53. Which would you like better: a) a story about 
fighting Indians, or b) a story about how In- 
dians made clothing? 

62. Would you rather: a) run, or b) sit still? 

63. Would you rather do something for: a) your 
mother, or b) your father? 

73. a) Do you have age when everything goes 


wrong, or b) are you happy all the time? 














Form A 


44. Does anyone ever call youacry-baby? a) Yes, 
or b) no. — 

45. a) Are you always pretty lucky, or b) do more 
bad things happen to you than to other children? 

54. Is your teacher: a) nicer to other children, 
or b) just as nice to you? 

64. On the playground: a) do you make alot of 
noise, or b) are you mostly quiet? 

74. Would yourather: a) build things with your 
friends, or b) build things by yourself? 

75. Do you think people ever say bad things about 
you behind your back? a) Yes, or b) no. 














Form B 


44. Do you know any children who are so dumb 
that it’s no fun to play with them? a) Yes, or 
b) no. ee 

45. Do you like to: a) run, or b) just walk? 

54. On the playground: a) do you play by yourself, 
or b) mostly with other children? 

64. a) Do you have alotof friends, or b) justa 
few friends? _ 

74. On the playground: a) do you run most of the 
time, or b) stand still a lot? 

75. Do you like to: a) help your mother and father 


with things, or b) just play? 


Factor 10 











Form A 


46. Would you rather go ona trip with: a) your 
father, or b) your mother? 

55. If you wake up in the dark: a) do you some- 
times feel scared, or b) doyoulike it because 
it’s so dark and quiet? 

56. Would you rather: a) play a noisy game, or 
b) look at a book by yourself? 

65. When people are talking about a movie that you 
have seen: a) do you want to tell it your way, 
or b) just listen to them? 

66. When your mother is angry: a) do you feel 
like crying, or b) do you feel happy anyway? 

76. If you were up on a big rock: a) would you be 
scared, or b) would you just laugh? 




















Form B 


46. Would you rather play with: a) older children, 
or b) younger children? 
55. Which would you like better: a) to hear stcries 
about bears, or b) to have bears here right 
now? 








56. Would you rathertalkto: a) your father, or b) 
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your mother? 
65. When you go to bed: a) do you lie awake for a 
long time, or b) do yougoto sleep right away ? 
66. When yousiart to say something: a) do grown- 


ups always listen to you, or b) do they do all 
the talking? 


76. Do you sometimes feel a little scared when 
you’re up on a high place? a) Yes, or b) no. 





Factor 11 


Form A 


47. When you get a new toy: a) do you let other 
children play with it, or b) do you watch out 
that no one breaks it? 

48. Does your mother ever go away for a little 
while and leave you at home all by yourself? 
a) Yes, or b) no. 

57. Do you like: a) a friend who talks a lot, or b) 
one who is quiet? 

67. Do you ever take your toys to bed with you? 
a) Yes, or b) no. 

77. Would you rather play: a) in your own yard, 











blocks or bricks, or b) play with other chil- 
dren? 





Form B 


49. a) Do you make your bedinthe morning, or b) 
does your mother make your bed? 

58. When you get angry: a) doyousometimes yell 
and stamp your feet, or b) do you just try to 
forget about it? 

59. Would you rather: a) look at a picture book 
by yourself, or b)look at it with another boy 
or girl? 

68. When children fight, do you think grown-ups 
sometimes punish the wrong one? a) Yes, or 
b) no. ia 

69. Do you: a) usually put your clothes away at 
night, or b) just leave them anywhere? 

79. If people don’t want to do the same things as 
you: a) does it make you mad, or b) do you 
just do what they want to do? 




















Interpretation of Factors 





or b) in somebody else’s? 
78. Are you ticklish? a) Yes, or b) no. 


Form B 


47. Can you do things: a) better than most boys 

and girls, or b) not as well as most boys and 
irls? 

48. a people sometimes bawl you out when you 
haven’t done anything wrong, or b) do they 
always treat you the way they should? 

57. a) Does someone wake you up in the morning, 
or b) do you wake up by yourself? 

67. Do you ever talk inyour sleep and wake your- 
self up? a) Yes, or b) no. 

77. a) Do you wish you had more friends to play 
with, or b) do you have enough friends? 

78. When your mother and father say it’s time for 
bed: a) do you like to go to bed, or b) do you 
want to stay up longer? 

















Factor 12 


Form A 


49. Do you like: a) to meet new boys and girls, 
or b) to be with children you already know? 

58. Do you like: a) to tell other children what to 
do, or b) do what other children want to do? 

59. Do you ever do things that you should not do? 
a) Yes, or b) no. 

68. Do you like to play games: a) with a lot of 
children, or b) just one or two children that 
you know? 

69. a) Can you remember stories, or b) do you 
forget them very soon? 

79. Would you rather: a) make something with 

















In our earlier reports, we offered some gener- 
al interpretations of questionnaire factors at the 
middle-childhood level. We have refrained, how- 
ever, from any extensive interpretation on the ba- 
sis of item content alone. But we have now as- 
sembled personality scales for twelve factors on 
a purely statistical basis. We can offer sound 
mathematical arguments for the utility of these 
scales, for they have been designed to cover a 
broad area of personality with maximal economy. 
The actual utility of these scales, on the other 
hand, now depends on our understanding of their 
psychological referents. 

Apart from manifest item content, there are 
two chief avenues by which we can hope to gain 
some insight into the factors that we are measur- 
ing: a) we can determine the way in which ques- 
tionnaire scores coOvary with other kinds of per- 
sonality measures, and b) we can link this ques- 
tionnaire instrument with those developed for other 
age levels by using a sample at an intermediate 
age level. The first type of comparison can be 
conveniently made, since we also have teacher- 
rating, parent-rating, and objective-test data for 
our Decatur sample. The analyses of these data 
have been reported elsewhere (1,3,5). In another 
article (2), we shall consider interrelations among 
factor scores derived from all of our various 
media of measurement and describe a common 
factor analysis of these scores. 

For the other type of comparison, it is conven- 
ient to relate the ESPQ tothe High School Person- 
ality Questionnaire (HSPQ). Thelattertest, which 
is described elsewhere (6), was devised for the 
12-to-16-year range. It contains 14 scales which 
have been identified in terms of factors that have 
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appeared in studies of adults and pre-adolescents. 
Our plan was to give the ESPQ and the HSPQ to a 
sample at about the nine- and ten-year level. It 
must be recognized, of course, that some of the 
ESPQ will seem a bit ‘‘childish’’ to such a group, 
while portions of the HSPQ will be incomprehen- 
sible to them. Some resultant distortion of cor- 
relations between the two tests is to be expected. 
No other procedure, however, permits such a di- 
rect comparison of questionnaire factors at the 
two age levels from which the two tests were de- 
rived. 

We can report findings for two independent 
samples.* The first sample consisted of two 
fourth-grade and two fifth-grade classes in Terre 
Haute, Indiana. There were 46 boys and 59 girls, 
making a total of 105. In age, they ranged from 
nine years and five months to eleven years and 
eleven months, with a mean of ten years and five 
months. These children received all items of the 


ESPQ, except for factor 13, at one session and " 


Form A ofthe HSPQ at another session. The 
product-moment correlations bet ween the ESPQ 
total scores and Form Ascores for the HSPQ are 
shown in Table I. 

A second sample was secured in Kent, Ohio. It 
consisted of one third-grade class and three fourth- 
grade classes. There were 43 boys and 49 girls, 
or a total of 92 children, ranging in age from eight 
years and five months to ten years and eleven 
months, witha mean of nine years andnine months. 
For this sample, we were able to obtain total 
scores, based on both forms, for all scales of 
each test. The lower age level of the Kent sam- 
ple is conducive to some loss of reliability and 
validity for the HSPQ. Any suchloss, however, 
was probably offset by the increased stability af- 
forded by the use of both forms of this test. In 
fact, we might justifiably attach more weight to 
the Kent data than to the Terre Haute data. The 
product-moment correlations between ESPQ and 
HSPQ scores for the Kent sample are shown in 
Table II. 

In each of our tables, there are several rows 
and columns in which more than one substantial 
correlation can be seen. This suggests the defi - 
nite possibility of age changes in personality struc- 
ture. At either age level (six-to-seven or twelve- 
to-sixteen), there could be factors intermediate 
between factors appearing at the other level. There 
could be processes of factor convergence, where- 
by eleven-year factors are discernible at the six- 
year level only in the second-order realm, or of 
factor divergence, whereby six-year factors be- 
come second-order at the eleven-yearlevel. Our 
present data, however, donot provide a sufficient 
basis for drawing any suchconclusions. The pic- 








ture is complicated by the approximate character 
of factor scores estimatedby the present methods 
and by the fact that, at every age level thus far 
studied, there are some factors that inevitably 
tend to ‘‘cooperate’’, or share high-loading vari- 
ables. In the construction of item scales for co- 
operative factors, a complete suppression of ir- 
relevant factor variance is often not possible. For 
this and other reasons, the intercorrelations with- 
in one set of factor scales will not faithfully rep- 
resent the actual intercorrelations among the fac- 
tors. Some less severe distortion of correlations 
with outside measures is to be expected. While 
possibilities for developmental shifts in factor 
structure thus exist, they must be examined inthe 
light of other evidence than that presented here. 

In any case, some of the ESPQ factors can be 
identified with some certainty if both the item con- 
tent and the HSPQ correlations are carefully 
weighed together. Any given correlation, of 
course, must be evaluated in the light of all other 
correlations obtained for its component variables. 

Factor 1 of the ESPQ shows some affinity with 
factors A, E, andI of the HSPQ. Since E andI 
seem to be represented elsewhere, factor A looms 
as the best match. Such aninterpretation appears 
to be consistent with the content of the items. The 
items, however, stress primarily a form of so- 
cialization incidental to adjustmentiin a school en- 
vironment. This suggests something akin to fac- 
tor K, or Socialized Morale, which has appeared 
only in rating data in previous studies of older 
groups. Inits present form, then, factor 1 may 
be identified as A- (Schizothymia vs. Cyclothymia) 
or as K- (Dislike of Education vs. Socialized Mo- 
rale). 

In factor 2, we see a pattern of irritability and 
subjective distress, which suggests O (Anxious 
Depression) or Q4 (Nervous Tension). At the 
same time, there is much in the item content that 
suggests the pattern of overreaction and impulsive- 
ness of factor D (Excitability or Infantile Emotion- 
ality). This elusive factor does not reveal itself 
clearly in questionnaire research with older groups, 
but present findings are consistent with Excitabil- 
ity as an interpretation of our present factor. 

Like factor 1, factor 3 seems to involve atti- 
tudes and adjustment with respect toschool. More 
generally, it seems to relate to an acceptance of 
adult authority and a successful incorporation of 
adult standards of conduct. The correlations are 
consistent with an interpretation of this factor as 
G (Superego Strength). 
~ Factor 4 seems to combine emotional sens i- 
tivity, timidity, and aesthetic interest in a way 
characteristic of factor I (Premsia or Emotional 
Sensitivity), but the opposing content is strongly 


*For the procurement of these samples, the authors are indebted to Dr. Rutherford B. Porter and Dr. 
Edna R. Oswalt and to the cooperating school personnel of Terre Haute, Indiana, and Kent, Ohio. 
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suggestive of factor E (Ascendance vs. Submis- 
siveness). The correlations support interpreta- 
tions in terms of factor A, factor E, and factor I. 
Factors A andI are probably both better repre- 
sented by other factors in the present set. Fur- 
ther evidence may clearly identify the present fac - 
tor as E- (Submissiveness vs. Ascendance). 

The most prominent feature of factor 5 seems 
to be a certain dependence on parents and teach- 
er, with a conformity to their wishes. The cor- 
relations are generally consistent with this con- 
tent, but they do not point to any single clear 
match with a recognizedfactor. Thetwo matrices 
agree in relating this factor to A, D-, E-, G, and 
H, but each of these seems to be better represent- 
ed by another factor in the set. 

Factor 6 shows emotional control and tolerance 
of frustration at the high end and a more ready 
loss of composure at the low end. The correla- 
tions most strongly suggest identification of this 
factor as D- (Phlegmatic Frustration Tolerance 
vs. Infantile Emotionality or Excitability), but the 
content is also quite consistent with O- (Confi- 
dence vs. Anxious Depression). - 

At the high pole of factor 7, we see an aggres- 
sive self-assertion which is consistent with the 
pattern of factor E (Ascendance vs. Sub missive- 
ness) at lower age levels. Another possible in- 
terpretation is F (Surgency vs. Desurgency), 
which, unlike E, is apparently not represented 
elsewhere in the present set of factors. Factor 8 
has no unambiguous match, though the correla- 
tions indicate some relationship to A, E, G, I, J, 
and Q3. The prevailing theme running through 
the items of this factor might be characterized as 
‘‘motor expressiveness vs. socialized restraint.’’ 

In the present correlation data, factor 9 aligns 
best with C. The item content accords generally 
with such a match, though it is a little more 
strongly suggestive of Aor H. The high scorer 
appears to combine certain schizothyme qualities 
of quietness, seclusiveness, and sullenness. A 
and H are supported by the correlations, though 
less strongly than C. Since Aappears more clear- 
ly elsewhere, factor 9 may be best identified as 
H- (Withdrawn Schizothymia vs. Adventurous Cy- 
Clothymia) or C- (Poor integration vs. Ego 
Strength). = 

Factor 10 correlates best with I, and the con- 
tent is clearly consistent with this identification. 
At its positive pole, thereisa certain boisterous- 
ness and toughness. Atthe negative pole, we find 
fearfulness and timidity. We may tentatively 
identify this factor as I- (Rigid, Tough Poise vs. 
Sensitive Emotionality). 

Factor 11 shows no marked relationship to any 
HSPQ factor, andno clear overall pattern is 
readily discernible in the items. It seems best 
to refrain from interpretation until further data 
are available. Factor 12 presents a pattern of 
friendly sociability and calmness at the high end. 





The aggressiveness at the lowendofthe scale ac- 
counts for a high negative association with E, but 
the overall pattern at the high end is not that of 
submissiveness. It is more nearly one of cyclo- 
thymia. Both A andH, however, appear tobe rep- 
resented elsewhere. Q4 (Low Nervous Tension 
vs. HighNervous Tension) is another possible in- 
terpretation. 

Factor 13 was not used inthe Terre Haute study. 
In the Kent study, it correlates appreciably only 
with factor B (Intelligence). This unambiguous 
finding actually exceeds expectations, for a corre- 
lation even as high as that obtained (+. 32) is sur - 
prising in view of the fact thatthe ESPQ and HSPQ 
intelligence scales are severely restricted at op- 
posite ends of therangein our sample. At the in- 
termediate age level of the Kent sample, the ESPQ 
fails to discriminate at the high end of the range, 
while the HSPQ fails to discriminate atthelow end. 


Summary 


Two equivalent forms were constructed for a 
factored personality questionnaire suitable for 
group administration to children in the six to eight 
year range. Each form contains 12 personality 
scales derived from two earlier factor-analytic 
studies. A thirteenth scale for intelligence was 
added to eachform. Scores on the questionnaire 
were correlated withthose for a questionnaire de- 
signed for older children for two independent sam- 
ples. Inspection of the correlation data and of the 
items themselves permits satisfactory identifica- 
tion of most of the factors. There are indications 
that the test taps each of the following known fac- 
tors: 


Cyclothymia vs. Schizothymia 

General Intelligence 

Ego Strength vs. Poor Integration 

Infantile Emotionality 

Ascendance vs. Submissiveness 

Surgency vs. Desurgency 

Superego Strength vs. Dependent Im- 

maturity 

Adventurous Cyclothymia vs. With- 

drawn Schizothymia 

I. Emotional Sensitivity vs. Toughness 

K. Socialized Morale vs. Dislike of Ed- 
ucation 

O. Anxious Depression 

Q4- Nervous Tension 


mB Om OOD > 


In general, it seems inadvisable to accept the 
factor identifications which we have indicated as 
final. There are some departures from the fac- 
tors found among adults that accord with other 
findings on the composition of known factors in 
childhood. There are a few factors that do not 
align clearly with any known factors. Further re- 
finement of the scales should reveal whether these 
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represent real structural novelties atthe middle- 
childhood level or simply poor approximations to 
identifiable factors. A study dealing with the re- 
lationships among questionnaire, rating, and ob- 
jective-test factors (2) may provide further clar- 
ification. 


ie 
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Problem 


NON-VERBAL group tests have received wide 
attention in psychological testing programs al- 
though their use in this country has been rather 
limited. The most extensive use of the non-verb- 
al group test has been in the military services 
where the problem arises of processing and clas- 
sifying great numbers of illiterates and linguisti- 
cally-handicapped persons. Basically these non- 
verbal tests have been of the general classifica- 
tion and academic aptitude type and have been 
used primarily for screening and classificatory 
purposes. Their development and use has 
stemmed from the premise that personnel with 
linguistic and reading handicaps are penalized on 
a test score basis on the attribute ostensibly be - 
ing measured. However, thedevelopment of these 
tests has been seriously handicapped by the diffi- 
culty of producing completely non-verbal items. 
Except for the Army Beta Examination (8) and the 
Semantic Test of Intelligence (5) in which the di- 
rections are given completely in pantomime, 
mostly so-called non-verbal tests are merely 
partly non-verbal tests. Furthermore, they are 
difficult to administer, expensive to produce, and 
the individual items in these tests are generally 
too easy for personnel of superior ability. Prob- 
ably for these and other apparent reasons the use 
of non-verbal group tests has been quite limited. 
In any event, even their limited use has been fur- 
ther restricted to the area of classification test- 
ing and these tests have not found application in 
the field of achievement testing where the per- 
formance test generally replaces the paper-and- 
pencil test when the latter is considered inappro- 
priate. 








In the U. S. Navy, a system of competitive 
service-wide examinations for the selection and 
advancement of petty officers has been established. 
All candidates for petty officer rates in the 60-odd 
naval enlisted occupations who have met minimum 
requirements and who have been recommended by 
their superior officers for advancement tothe next 
higher pay grade level compete on these examina- 
tions which are administered semi-annually 
throughout the naval service. These examinations 
consist of 150 multiple-choice items designed to 
sample the professional and military knowledges 
and skills required for successful job perform- 
ance in the rate being sought. They are further 


. designed as highly discriminating selection tests 


since only those candidates are allowed to compete 
who have been selected by their commanding offi- 
cers and who have been certified by them princi- 
pally on the basis of training and on-the-job per- 
formance to be at least minimumly qual ified for 
advancement. 

In order to provide for a universally fair and 
equitable system, all candidates for advancement 
are required to compete with their contemporar- 
ies on the identical standard. The principal com- 
ponent of this selection criterion is the standard- 
ized paper-and-pencil achievement type test de- 
scribed above. 

Among the 60-odd occupations which the U. S. 
Navy maintains in its petty officer structure at 
four pay grade levels, there is considerable vari- 
ation both between and within groupsin verbal and 
linguistic ability. In the more highly specialized 
and technical ratings such as Electronics Techni- 
cian, Air Controlman, and Aerographer, although 
there is reasonably high variance in this attribute 
within the groups, the overal! ability is, on the 


* The opinions and assertions expressed in this paper are solely those of the authors, and do not neces- 
sarily reflect the official views of the Navy Department. 


**Complete address: Walter V. Clarke Associates, 324 Waterman Ave., East Providence 14, R. I. 
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average, generally high and does not appear to 
constitute much of a problem in terms of its con- 
tribution toerror variance on the advancement 
test score variable. However, in the less techni- 
cal service ratings such as Steward’s Mate, Ships 
Serviceman, and Commissaryman, the relatively 
high variance coupled witha generally low overall 
verbal ability has led to the formulation of the hy- 
pothesis that completely verbal items in the ad- 
vancement tests tend to reduce the validity of the 
measurement of job proficiency for members of 
these low verbal ability groups. The study re- 
ported in this paper was designed to test this hy- 
pothesis. 


Procedures and Results 





It was decided to conduct this res earch in the 
rating of Steward’s Mate at the level of third class 
petty officer candidates since the Steward’s Branch 
in the Navy represents the lowest level of verbal 
ability. In addition to the generally low verbal 
ability of personnel in this rating, there exists the 
additional feature of linguistic-handicap (bilingu- 
alism) caused by the employment of great numbers 
of native-born and native-educated Filipinos. 

For the purposes of this study verbal ability is 
defined and measured by performance on the Gen- 
eral Classification Test of the Navy Basic Test 
Battery. Factor analysis (9) andcorrelational (7) 
studies have shown this test to possess an excep- 
tionally high verbal factor loading. Scores onthis 
test are reported as Navy Standard Scores (X = 50; 
o0= 10). From samples drawn from among can- 
didates for advancement to petty officer, third 
class in 52 naval occupations in February 1955, 
Steward’s Mates proved to represent the signifi- 
cantly lowest level on verbal ability (K = 35.70; 
o = 8.40). Thenext higher level was reached by 


Gunner’s Mates (X = 40. 47; o = 7.77) (C.R. = 5.58). 


A team composed of a subject-matter special - 
ist (Chief Steward’s Mate employed at the Naval 
Examining Center as item writer for Steward’s 
Branch Examinations), a professional test tech - 
nician (civilian educational specialist normally 
assigned to the task of reviewing and editing items 
in Steward’s Branch Examinations), and a com- 
missioned officer of the Supply Corps who norm- 
ally reviews items for Steward’s Branch Examin- 
ations from thestandpoint of technical accuracy 
worked on the project of developing the exper- 
imental examination forms. 

In developing these test forms, items were 
either drawn from item card files or were con- 
structed as original items. The criterion stand- 
ards which items were required to meet in order 
to be considered acceptable for inclusion in this 
examination were: 

1) They must have sampled the petty officer 
requirements (1) of Steward’s Mates particularly 
at the third class level; 2) for items drawn from 





the files, they must have been of appropriate dif- 
ficulty level (p> .10 <.90) and, also, of positive 
and significant discrimination (D>+ 0.4) (3); and 
3) the material upon which each item was to be 
based must have been of such a nature that it would 
lend itself equally well to presentation both in 
completely verbal form andin picture form (illus- 
tration item). 

Accordingly, a sufficient number of items were 
either selected or constructed to meet all three of 
the above criteria. Eachitem was developed in 
two forms: 1) completely verbal form—verbally 
presented stems and options (four choices), and 
2) illustration form—verbal stem with options 
presented in picture form. In both forms the use 
of language was kept to a minimum, and to the ex- 
tent possible the words employed were kept in the 
vernacular. 

Two test forms, each consisting of the same 
fifty items were developed. In form II, the form 
composed of the picture items, the order of pre- 
sentation was reverse from that of form I ( verb- 
al form). In all other respects, the items were 
identical, except for the mode of presentation. 
The same explicit and careful directions and gen- 
eral instructions were presented onthe cover page 
of both forms, and both were reproduced by com- 
mercial offset printing methods. In the printing 
of the final forms, however, the illustrated form 
was reproduced about twice the size of the verbal 
form in an attempt to minimize any confusion 
which migh arise on the part of the subjects to be 
tested by the lack of detail in the pictures which 
sometimes results from the reduction in size from 
original to final form by this method of printing. 

The experimental tests were administered in 
September 1955, to a total of 111 Stewardsmen 
representing nine different fleet units and stations 
who were assembled at the Naval Air Station, North 
Island, San Diego, California. Prior to adminis- 
tering the tests, the group of 111 subjects was 
divided into two subdivisions (N, = 67 and N, = 44). 
Sub-group one was administered the verbal test 
form first and the illustrated test form second. 
For sub-group two, the above procedure was re - 
versed. This process was employed in order to 
minimize the influence of ‘‘practice effect.’ Orig- 
inally, it was planned to administer the forms al- 
ternately to the entire group, but physical facili- 
ties and other factors made it necessary to divide 
the groups in this manner and in these numbers, 
and to administer the tests in two separate testing 
areas. 


Analysis of the Data with Interpretations 





Total Sample— For the 111 subjects, means and 
standard deviations were calculated for scores ob- 
tained on each test form; and then employing the 
difference scores, the difference between these 
means was tested forsignificance. These results 





MERENDA - MACALUSO 


SAMPLE QUESTION 


SAMPLE ANSWER FORM 


1. Which is NOT kept in the officers' galley? 


are presented in Table I. Only a very slight dif- 
ference was found between the mean scores ob- 
tained by the total sample on the verbal and illus- 
tration forms of this test. Statistically, this dif- 
ference borders on significance at the five percent 
level of confidence. However, it does appear that 
variation of the mode of presentation of items does 
not greatly influence the total test scores. 

American Group vs. Filipino Group—In order 
to discover whether or not the mode of presenta- 
tion of items might show different effects for the 
more seriously verbal-handicapped than for the 
lesser linguistically-handicapped, the total group 
was divided into American-born and Philippine- 
born subjects. 

The differences between mean scores on both 
test forms obtained by originally English speaking 
subjects and originally Filipino speaking subjects 
proved to be statistically significant. A further 
inspection of the data in Table II reveals that the 
mean scores and variances obtained by these two 
groups on the verbal and illustration forms are 
practically the same, andthat they are very close 
to the total sample values. Here again, itis shown 
that mode of item presentation does not influence 
final test scores. 

Higher Verbal Group vs. Lower Verbal Group— 
A further comparison was made between the scores 
obtained on each form by subjects of higher vs. 
lower verbal ability as measured by the Navy Gen- 
eral Classification Test. Distribution of GCT 
scores of the total sample (N = 107)* yieldeda 
mean of 36.67 and a standard deviation of 6.91. 
Q, andQ, values were determined and were found 
to be, respectively, 30.8 and 40.4. The high and 
low groups were selected as Close to these points 
as possible (31 and 40). Due to skewness in the 
distribution and rounding-off procedures, the re- 
sulting sub-groupN’s were NQ3 = 37 and NQ;* 27. 














The data in Table II] again demonstrate the lack 
of influence of mode of presentation within groups 
of both higher verbal ability andlower verbal abil- 
ity. It is noted, however, as expected, that the 
test score means on the respective forms of the 
subjects within the higher verbal ability group are 
significantly higher than those of the lower verbal 
ability group. Here, again, the hypothesis of no 
difference must be accepted for both groups. 

Study of ‘‘Practice Effect’’—Since the subjects 
included in this sample: were scheduled to take 
both forms of this experimental test inone sitting, 
the situation required that ‘‘practice effect’’ be 
controlled to the highest degree possible. Unfor- 
tunately, however, circumstances forced the ad- 
ministrators to alter their original plans of admin- 
istering the forms alternately to all subjects seat- 
ed in a single testing room. Restriction of facili- 
ties both within and outside the testing room ne- 
cessitated the splitting up of the total sample into 
two separate groups seated in two distinct testing 
areas. The sub-groups were randomly selected 
and again, due to physical restrictions it was im- 
possible to equate the groups in size. Groupl 
numbered 67 and was administered the verbal 
form first, then the illustration form. Group ll 
was composed of 44 subjects, andfor them the ad- 
ministration procedure was reversed. The groups 
were kept intact and were separatedduring a short 
break period between administrations. Both were 
told that they would be taking two examinations on 
the same subject but in different forms. Neither 
group was told that the items contained in one form 
were identical to those in the other form. No 
strict time limits were imposed on either group, 
but the administration time for both groups was 
approximately equal. The GCTscores were stud- 
ied for subjects who took the verbal form first and 
those who took the illustration form first. 





*There were four subjects for whom GCT scores were not available. 
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The practically identical means (Mp = 0.05) 
show that the random selection of subjects to 
Groups I and II prior to the administration of the 
tests was indeed effective in producing two groups 
equated with respect to general verbal ability. 

To test for the presence of any ‘‘practice ef- 
fect’’ within groups, means and standard devia- 
tions were calculated on each test for those sub- 
jects who took the verbal form first and those who 
took the illustration form first, andthe differences 
between means were tested for significance. 

The data in Table V reveal the existence of 
some ‘‘practice effect.’’ It will be noted that for 
each sub-group there was a Statistically signifi- 
cant increase in the average scores obtained on 
the test of the second administration. This finding 
is as expected since itis practically impossible to 
completely control this effectasa source of error 
variance. It is only hoped that this effect will be 
kept at a minimum, and in this light the fact that 
the differences between mean scores, although 
statistically significant, were not too great was 
encouraging. 

A further study of the ‘‘practice effect’’ was 
made by comparing the results between groups on 
the two forms of the test. Means and standard de- 
viations of verbal form scores and illustration 
form scores were compared between those yield- 
ed by the group taking the verbal form first and 
those taking the verbal form second. 

The data in Table VI reveal evidence of the 
counter-balance influence on ‘‘practice effect’’ by 
the employment of sub-samples alternately ad- 
ministered the two test forms. Although the data 
in Table V show evidence of the existence of ‘‘prac- 
tice effect’’ the fact as revealed in Table VI that 
neither form was more difficult or easier than the 
other on both administrations and that the differ- 
ences in mean performance were in direct pro- 
portion to the order in which taken demonstrates 
the efficiency of the administration procedures in 
minimizing the influenee of ‘‘practice effect. ”’ 
The data in Table VI alsoshow that there was less 
change in scores from one ad ministration to the 
other with the illustrationform(P> .05) than with 
the verbal form (P <.01). 


Reliability and Validity of the Test Forms 





Reliability— The reliability of theseforms was 
determined by the Rulon (5) Method applied to par- 
allel forms and correlated samples. The formu- 
la employed for determining this reliability coef- 
ficient is identical to the formula for deriving the 
correlation coefficient between a set of difference 
scores. 


o1? + 02? - od? 
20,02 


745 * 





For these tests the reliability coefficient was 





found to be .55. Since the experimental forms 
were each composed of only 50 items whereas the 
standard examinations are comprised of 150 items, 
an estimate of the total test reliability was com- 
puted according to Guilliksen (2), and was found to 
be equal to .79. This is evidence of reasonably 
high test reliability. 

For each test form a K-R # 20 reliability esti- 
mate was obtained. This formula yielded an in- 
ternal consistency coefficient of .47 for the illus- 
tration form (50 items) and.52for the verbal form 
(50 items). These statistics transformed to Z’s 
and tested for significance proved to be homogen- 
eous (x* = 0.19); thus yielding further evidence of 
equal reliability for both forms. The lower aver- 
age value of the Kuder-Richardson coefficient as 
compared to the Rulon coefficient is to be expected 
since dispersion of item difficulties and item het- 
erogeneity were present inthe tests. These are 
conditions under which the K-R #20 coefficient is 
attenuated. 

Validity— An external criterion was available 
for establishing the predictive ability of these tests. 
The validity criterionwas the Advancement in 
Rating Examination standard score obtained by 47 
subjects of the total sample of 111 on the August 
1955 Navy-wide competitive Steward’s Mate exam- 
inations. Bothverbal and illustration form scores 
for this sample were individually correlated with 
the criterion scores. In each case, the resulting 
correlation coefficient was .53. This value was 
corrected for increase inlength of the total test 
(Kt = 150) by a method described by Gul liksen (2: 
89) and the final validity coefficient for each test 
form was found to be . 63. 


Item Analysis Data with Interpretations 





Total Sample— The responses selected by the 
subjects to each of the parallel items were subjec- 
ted to an extensive item analysis treatment. The 
first treatment involved the separating of the total 
sample into 25% high and 25% low on each test 
form, and computing difficulty level estimates and 
discrimination indices according to Lawshe (3) for 
each of the parallel items. These item analysis 
data for each set of parallel were then compared 
by visual inspection for apparent real differences. 
Since these data represented correlated propor- 
tions (repeated measurements on the same sub- 
jects) the conventional statistical test of signifi- 
cance was not appropriate, and therefore could not 
be applied to the data in this form. A rough esti- 
mate of real diiferences between parallel items 
based upon this type of subjective judgment re- 
vealed that less than twenty-five percent of items 
appeared to show differences in difficulty level and 
power of discrimination when presented in differ- 
ent mode. These item analysis data 'showed that 
the great majority of the items in both forms 
proved to be high positive in discrimination power 
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and to possess high internal consistency. 

To test the statistical significance of the differ- 
ence between difficulty levels of the parallel items, 
a method described by McNemar (4:55-59) was 
employed. The total sample of 111 cases was ran- 
domly reduced in size to 100 inorder to facilitate 
the computation of proportion values, and the pro- 
portions passing and failing each parallel item 
were analyzed. 

Of the 50 sets of item pairs (verbal vs. illus- 
tration presentation) a total of 35 yielded dif fer- 
ences in difficulty level which were not statisti- 
cally significant. Of the 15 pairs of items in which 
statistically significant differences occurred, nine 
proved to be less difficult in illustration form, 
and six were less difficult inverbal form. One of 
the nine item pairs which appeared to be less dif- 
ficult in illustration form (C.R. = 2.12) proved to 
be almost as easy in verbal form (P] = .99, Py = 
.93). For these fifteen items, the following spe- 
cific findings are reported: 


1. Items in which fine discriminations and dis- 
tinctions are required to be made appear to in- 
crease in difficulty when presented in illustration 
form. For example, in both item sets in which 
the examinee is required to discriminate between 
a dessert spoon and a teaspoon, a significant in- 
crease in difficulty of the illustration form was 
noted. 

2. In items which require a close attention to 
detail there also appears to be a corresponding 
increase of difficulty level in illustration form. 
For example, in item sets in which the examinee 
is required to pay close heed to details which are 
somewhat confounded in the reproduction of the 
illustration, there followed a substantial increase 
in the difficulty level of the illustrated item. 

3. In items in which either clearly descriptive 
action or identification of things is involved, there 
appears to be a decrease in the difficulty level of 
the illustrated item. This was evident in eight of 
the fifteen item sets which produced statistically 
significant differences. 


Summary and Conclusions 





To test the hypothesis that completely verbal 
items in achievement type tests tend to reduce 
the validity of measurement for members of low 
verbal ability groups, two experimental test forms 
were developed and administered to a sample of 
111 Navy Stewardsmen, 44 of whom were native- 
born Filipinos and 67 of whom were native-born 
American Negroes. Both test forms contained 
the identical items, each sampling knowledges and 
skills required for promotion to Steward’s Mate, 
Third Class. In one form the mode of item pre- 
sentation was completely verbal, whereas in the 
other form the items were in illustration form 
(verbal stems, illustrated options). 





Both test forms were administered to the same 
subjects, and comparisons of performance were 
made both on a total score basis and for individual 
pairs of parallel items. Both experimental tests 
proved to beparallelforms. Reliabilities and va- 
lidities of each form were also identical. Of the 
50 item sets (pairs) 35 provedto be of equal diffi- 
culty level; andof the 15 which showed statistical - 
ly significant differences, nine were less difficult 
in illustration form, and six were more difficult. 
Also no significant differences were found to exist 
between the performance on both test forms of 
American-born and Philippine-born subjects; and 
between subjects of higher and lower verbal abili- 
ties. 

From the findings obtained in this study, the 
following conclusions appear to be justified: 


1. Varying the mode of presentation (from com- 
pletely verbal to partially verbal form) of achieve- 
ment type test items does not appreciably affect 
the predictive validity of the measurement when 
applied to members of the low verbal ability group 
studied. 

2. For the major portion of the material upon 
which this type of measurement is based, presen- 
tation in either form produces reasonably the same 
results. 

3. The required increase in time, effort, and 
nonetary costs involved in the development and 
construction of large numbers of illustration items 
would not be justified. 
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IN DIFFERENT CULTURAL SITUATIONS 
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Auburn, Alabama 


The Problem 


IN THE FIELD of perceptual psychology, it is 
sometimes held that behavior is a function of per- 
ception (15). 

Authoritarian behavior has been the subject of 
much research in recent years, and certain psy- 
chologists hold that those persons classed as ‘‘au- 
thoritarians’’ are less able to perceive reality, 
tolerate ambiguity, resist suggestion, and assert 
their independence than are ‘‘nonauthoritarians. ”’ 
Accordingly, if behavior is a function of percep- 
tion, then these behavior characteristics of au- 
thoritarians may be products of the way they per- 
ceive. One purpose of the present study was to 
explore this possibility. 

Further, there is general feeling that an indi- 
vidual’s cultural background exerts considerable 
influence on his personality development. If this 
view is correct, interest attaches to the way in 
which the culture exerts its influence. Is it that 
a particular cultural situation tends to develop in 
the individual certain perceptions, and in this 
manner influences the very fabric of his person- 
ality? In the present study, groups of persons 
from two different cultural backgrounds were in- 
cluded in order to obtain some evidence on this 
particular problem. 

Maslow (13) first outlined the authoritarian 
character in 1943, listing, among other things, 
some of the behavioral characteristics which were 
examined in thisparticular study. Adorno (1) and 
his colleagues devoted considerable time and 
space reporting their studies of authoritarian 
personalities, out of which emerged the F (Fascist) 
scale, a measure of anti-democratic potential. 
Christie and Jahoda (3) examined the entire re- 
search procedure involved in the de velopment of 
the F scale and seriously questioned some of the 
techniques and conclusions of Adorno and his col- 
leagues. Titus and Hollander (17) reviewed over 
sixty studies which usedthe F scale in psycholog- 
ical research from 1950 to 1955 and concluded 
that the concepts of ‘‘personality’’ and ‘‘syn- 
drome’’ were without meaning. They further con- 
cluded that the use of the F scale as a predictive 





instrument was still questionable. Masling (12) 
reviewed four studies of authoritarianism and con- 
cluded that the characterization of the authoritar- 
ian had been overdrawn. Drucker (5) studied col- 
lege students and student nurses and reported that 
conformity was related to authoritarianism. 
Koontz (11) compared white and Negro students 
from two Southern universities and found no sig- 
nificant differences regarding authoritarianism 
between the two ethnic groups. Further, he found 
only nine percent of the total sample were ‘‘true 
authoritarians,’’ and also that intolerance of am- 
biguity was probably not highly relatedto authori- 
tarianism. Christie andGarcia (2) compared stu- 
dents in California with a Southwestern city and 
found that students from the Southwestern city had 
higher F scale scores. Davids (4) studied 20 col- 
lege students and found no significant relationship 
between F scale scores and tolerance of either 
ambiguous visual or auditory stimuli. Frenkel- 
Brunswik (6,7) pointed out that prejudiced persons 
were less able to tolerate ambiguity. Maslow (14) 
discussed ‘‘self-actualizing people’’ and reported 
that such persons had clear perceptions of reality, 
could tolerate ambiguous situations, and were 
readily able to assert their independence should 
the situation so arise. Snygg and Combs (15) de- 
scribed an ‘‘adequate personality’’ in similar 
terms. Studies of visual perception of reality re- 
ported by Koffka (10) and Kelley (9) indicated that 
reality derived its characteristics from the total 
Situation, including the perceiver, rather than 
from the stimuli alone. Finally, Stevens (16) said 
that the role of the listener needed to be studied 
by ‘‘varying the stimulus and/or by varying the 
listener. ’’ 

Three major hypotheses were developed in or- 
der to aid in the exploration of the specific aspects 
of the problem. Each of these major hypotheses 
had several related sub-hypotheses. These are: 


1. Persons from a Southern rural cultural situa- 
tion will be more authoritarian than persons 
from a Northern urban cultural situation. That 
is, persons from aSouthern rural cultural sit- 
uation will: 
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a) score higher on the F scale 

b) be less able to perceive reality 

c) be less able to tolerate ambiguity 

d) be less able to assert their independence 
e) be less able to resist suggestion 

f) have fewer perceptions 


2. Girls will be more authoritarian than boys. 
That is, girls will: 
a) score higher on the F scale 
b) be less able to perceive reality 
c) be less able to tolerate ambiguity 
d) be less able to assert their independence 
e) be less able to resist suggestion 
f) have fewer perceptions 


There wiil be a pattern of authoritarian behav- 
ioral characteristics which will indicate the 
existence of a syndrome. Specifically, there 
will be a positive relationship between the fol- 
lowing factors: 
a) F scale scores and suggestibility 
b) F scale scores and fewer perceptions 
c) Suggestibility and fewer perceptions 
d) Perception of reality andassertion of inde- 
pendence 
Tolerance of ambiguity and assertion of in- 
dependence 
Tolerance of ambiguity and perception of 
reality 


Further, there will bea negative relationship be- 
tween the following factors: 
g) F scale scores and perception of reality 
h) F scale scores and tolerance of ambiguity 
i) F scale scores and assertion of ind epend- 
ence 
j) Suggestibility and perception of reality 
k) Suggestibility and tolerance of ambiguity 
1) Suggestibility and assertion of independence 
m) Fewer perceptions and perception of re- 
ality 
n) Fewer perceptions and tolerance of ambi- 
guity 
o) Fewer perceptions and assertion of inde- 
pendence 


The Procedure 





The author developed a non-verbal sound test 
(8) after a series of seven pilot studies involving 
486 persons, and then administered this test, to- 
gether with tests of hearing acuity andthe F scale, 
to groups of high-school students in rural Ala- 
bama and urban Michigan. 

The sound test cons isted of a series of ques- 
tions and thirty-four non-verbal sounds recorded 
on magnetic tape. The questions were designed 
to procure specific information (age, sex, grade 
in school, etc.,) from each of the students parti- 
cipating in the study. Twenty sounds required a 





response to an aural stimulus in a situation un- 
structured exceptfortime. Fourteen sounds, each 
of which was played two times, required re- 
sponses in a more highly structured situation. In- 
cluded also were two sample sounds designed to 
acquaint the students with the general proc edure 
and style of the test. The tape was played at a con- 
stant speed and volume. 

Each sound required approximately ten sec- 
onds of playing time and was followed by exactly 
eighteen seconds of silence. After hearing each 
of the first twenty sounds the subjects wrote out in 
a word or short phrase a description of the sound. 
Those sounds which were accurately described 
i.e., the description listed by the student was 
‘‘correct,’’ wére considered as correct perc ep- 
tions of reality. Those sounds which were not de- 
scribed, i.e., the student did not write out any 
kind of description in the space provided, were 
considered as ‘‘not perceived.’’ In other words, 
for the purposes of this study unless a particular 
aural stimulus evoked a description of that sound, 
it was considered as not perceived. 

The last fourteen sounds, each of which was 
played two times, required responses in amore 
structured situation. Thatis, eachof these sounds 
had four alternative descriptions from which the 
student could choose, and one of these was sug- 
gested asbeingcorrect. Actually, ‘none of the al- 
ternatives were correct descriptions of the partic- 
ular sounds, although the last choice in each ser- 
ies was ‘‘something else,’’ thus a subject could 
respond in that manner if he desired. If a subject, 
after hearing a sound, indicated his selection as 
that one which corresponded with the suggested 
choice, this was scored as indicating suggestibil- 
ity. If a student decided that none of the descrip- 
tions listed were correctand selected ‘‘something 
else’’ as his choice, this was scored as an indica- 
tion of his ability to assert his independence. Af- 
ter all fourteen sounds had been heard, they were 
replayed, and each subject was asked to indicate 
his degree of certainty about his initial percep- 
tion, and this served as anindication of his ability 
to tolerate an ambiguous situation. That is, since 
none of the structured alternatives were correct 
except the ‘‘something else,’’ and this was only 
partially correct since it did not describe the ac- 
tual source of the sound, the degree of certainty a 
subject had about these ambiguous alternatives 
was ascertained. 

For the purposes of this study, then, a percep- 
tion of reality representedthe correct description 
of an aural stimulus. Suggestibility represented 
an individual’s response which corresponded to a 
suggested response for a particular aural stimu- 
lus. Tolerance of ambiguity represented the de- 
gree of certainty an individual had about his ini- 
tial aural perception upon hearing certain sounds 
for a second time. Assertion of independence 
represented an individual’s selection of the re- 
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sponse ‘‘something else’’ when presented with a 
particular aural stimulus and three other alterna- 
tive responses, none of which were correct, but 
one of which was suggested as being correct... Fin- 
ally, those sounds which were not described were 
considered as not perceived. 

The F (Fascist) scale (1), a commonly used 
device inpsychological research, was selected as 
an instrument to determine an individual’s tend- 
ency to be authoritarian. The instrument em- 
ployed in this study consisted of 27 items admin- 
istered as a ‘‘Public Opinion Questionnaire, ’’ the 
subjects indicating the extent of their agreement 
or disagreement with the various items. High F 
scale scores indicated a tendency toward author- 
itarianism. 

The final group studied consisted of 22 girls 
and 42 boys from urban Michigan, and 48 girls 
and 43 boys from rural Alabama. The mean age 
of the Michigan sample was 16.31 years, and the 
mean age of the Alabama sample was 16.40 years. 
All subjects were Caucasian and had lived ten or 
more years in their particular locale. Only the 
results for those persons who passed a pure tone 
audiometer screening test were included in this 
report. 

The subjects from Michigan came from a large 
metropolitan area in the southeastern section of 
the state. The school, located centrally in this 
automobile manufacturing city, had approximate- 
ly 2000 students in grades nine through twelve. 
The student body was a heterogenous group rep- 
resenting different races, nationalities, and re- 
ligious affiliations, as well as deviations in social 
and economic status. Twogroupsofninth-grade 
and two groups of twelfth-grade students from this 
area were included in this study. 

The subjects from Alabama came from asmall, 
rural community situated about 55 miles southwest 
of Chattanooga, Tennessee, on top of what was 
commonly known as Sand Mountain. The popula- 
tion was exclusively white and the four churches 
inthe immediate vicinity were all Protestant. 
The enrollment of the consolidated school from 
which this groupcame was approximately 750 stu- 
dents, grades one through twelve. Ninety-eight 
percent of the school population were bus-trans- 
ported. The student body was essentially a ho- 
mogeneous group, 211 white, and almost exclu- 
sively Protestant in their religious affiliation. 
Most of the students lived on small farms, thirty 
to forty acres in size. Two groups of ninth-grade 
and two groups of twelfth-grade students from this 
area were included in this study. 

The statistical differences between the groups 
from these two cultural situations and between 
sexes was compared by computing t for the follow- 
ing scores: authoritarianism as depicted by the F 
scale, perception of reality, tolerance of ambi- 
guity, assertion of independence, and number of 
sounds not perceived. Further, coefficients of 





correlation (r) between pairs of scores for stu- 
dents from both cultural situations, from both 
groups combined, for all of the boys, andfor all 
of the girls combined were also computed. 


The Results 


Table I shows a comparison of mean scores of 
the Southern rural and Northern urban group per- 
formances on the F scale, perception of reality, 
tolerance of ambiguity, suggestibility, assertion 
of independence, and sounds not perceived, as 
ascertained in this study. 

In general, persons from the Southern rural 
group scored markedly higher on the F scale 
(mean of 5.20) than did those persons from the 
Northern urban area (mean of 4.02). Further, 
those persons from the Southern rural group also 
had fewer perceptions of reality (mean of 5.74) 
than persons from the Northern urban group (mean 
of 8.47), were more suggestible (Southern rural 
group mean of 7.09, Northern urban group mean 
of 6.59), were less able to assert their independ- 
ence (Southern rural group mean of .34, Northern 
urban group mean of 1.47), and had fewer percep- 
tions, i.e., more sounds not perceived (Southern 
rural group mean of 4.44, Northern urban group 
mean of 2.25). Those persons from the Southern 
rural group, however, were more able to tolerate 
ambiguity (mean of 27.52) than were members of 
the Northern urban group (mean of 25.72), as 
measured in this study. All of these differences 
were Statistically significant at the five percent 
level or higher except that one between mean scores 
on the suggestibility index. 

Table II shows a comparison of the mean scores 
of boys from these two areas combined and of the 
girls on these same factors. These data indicate 
that there is a tendency among the girls tested in 
this particular study toward a pattern of authori- 
tarian behavior characteristics, although none of 
the mean scores are significantly greater than 
those of the boys. Girls did score higher on the 
tolerance of ambiguity measure, but this differ- 
ence was not statistically significant either. 

Table III shows the coefficients of correlation 
between various pairs of the following scores: F 
scale, tolerance of ambiguity, perception of real- 
ity, assertion of independence, suggestibility, and 
number of sounds not perceived, as measured in 
this study. The correlation coefficients (r) were 
calculated for the Southern rural sample, the 
Northern urban sample, for both groups combined, 
for all of the boys, and for all of the girls. In this 
table those statistics which are underlined once 
( ) represent relationships in the predicted di- 
rection, whereas those which are underlined twice 
(__) represent relationships which are in the di- 
rection predicted and which are statistically sig- 
nificant at the five percent level of confidence. 

There are several observations about Table I. 
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TABLE I 


COMPARISON OF MEAN SCORES OF SOUTHERN RURAL AND NORTHERN URBAN GROUPS 
ON SIX BEHAVIORAL CHARACTERISTICS 





Group Am 





Northern Urban : . 25.%2 6.59 


Southern Rural " . 27.52 7.09 





t 2. 09* 1.24 





F scale scores Assertion of independence 
Perception of reality Sounds not perceived 

Tolerance of Ambiguity Differences significant at 5% level 
Suggestibility 


nuit 


TABLE I 


COMPARISON OF MEAN SCORES OF BOYS AND GIRLS ON SIX BEHAVIORAL CHARACTERISTICS 











1, 24 





F scale scores S Suggestibility 
Perception of reality I Assertion of independence 
Tolerance of ambiguity NP = Sounds not perceived 
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The first, and perhaps the most important, is the 
fact thatnone of the correlation coefficients are 
very high, even though several are statistically 
significant. 

Although these coefficients of correlation are 
not high, there seems to be some consistency or 
pattern of relationships which may be worthy of 
consideration. For instance, in the Southern ru- 
ral group eleven of the fifteen relationships are in 
the direction predicted, and six of these are sta- 
tistically significant. On the other hand, the North- 
ern urban group has only three relationships which 
are Statistically significant, andone of these is in 
the direction opposite from that which was hypoth- 
esized. 

If both groups are combined and correlation 
coefficients calculated for the total sample of 155 
persons, again the pattern of consistency seems 
to emerge. Those persons who tend toward au- 
thoritarianism as measured by the F scale seemed 
to be less able to perceive reality (r of -.28), less 
able to assert their independence (r of -.30), and 
had fewer perceptions (r of .21). Al- 
so, those persons who had fewer perceptions tend- 
ed to be more suggestible (r of . 26). 

If one examines those coefficients of correla- 
tion between these various factors for all of the 
girls it seems that there is some very definite 
pattern apparent since all fifteen coefficients are 
in the direction predicted, although only four of 
these are statistically significant. The same pat- 
tern seems toprevail between pairs of scores for 
the boys, and although there are fewer relation- 
ships in the predicted direction, six of these co- 
efficients of correlation are statistically signifi- 
cant. 

It must be mentioned, however, that a major 
portion of the size of the r’s between assertion of 
independence and suggestibility scores and between 
perception of reality and sounds not perceived 
scores is probably due to the fact that the design 
of the test was such that one of either of these 
pairs of scores could, under certain circum- 
stances, cancel out the other. That is, if a per- 
son did not describe a sound, he could not have a 
correct perception of reality, thus this must ac- 
count for at least some of the higher coefficients 
of correlation between perception of reality scores 
and sounds not perceived. Likewise, a person 
who selected the suggested response could not se- 
lect ‘‘something else,’’ and thus assert his inde- 
pendence, as measured in this study; therefore, 
these higher coefficients are probably due partly 
to the structure of the test itself. 


Discussion of the Results 





Persons from the Southern rural cultural situ- 
ation scored higher on the F scale, had fewer 
correct perceptions of reality, wereless able to 
resist suggestion, and also had fewer perceptions. 





It seems that these data, then, are sufficient to 
accept, at least tentatively, the first hypothesis, 
namely, that persons froma Southern rural cultur- 
al situation will be more authoritarian than per- 
sons from a Northern urban cultural situation. 

And although the girls scored higher on the au- 
thoritarian index, had fewer perceptions of reality, 
were less able to resist suggestion, were less 
able to assert their independence, and had fewer 
perceptions, none of these scores differed signif- 
icantly from those of the boys. These data seem 
insufficient to accept the second hypothesis, there- 
fore, that girls will be more authoritarian than 
boys, even though there does seem to be some ev- 
idence of a pattern of characteristics tending in 
such a direction. 

Of the 75 coefficients of correlation calculated 
to portray the various relationships investigated 
in this particular study, 54 were in the direction 
predicted. Of this number, however, only 26 
were significant statistically, andof these 26, nine 
were of a questionable nature (i. e., those between 
perception of reality andsounds not perceived and 
between suggestibility and assertion of independ- 
ence). Only three coefficients of correlation in 
the direction opposite from that which was hypoth- 
esized were statistically significant, and all of 
these were between pairs of scores portraying the 
relationship of tolerance of ambi guity to number 
of sounds not perceived. 

Although the data in this study are insufficient 
to accept the sub-hypotheses that tolerance of am- 
biguity is negatively relatedto certain authoritar- 
ian behavioral characteristics, something more 
significant may be inherent in these same data. 
Those persons who scored lower on the F scale, 
had more correct perceptions of reality, and who 
had more perceptions also tendedto be more cer- 
tain (i.e., less tolerant of ambiguity). However, 
these persons were also more correct; they were 
more effective in their perceptual processes. If 
they were more certain and less tolerant, they 
were so justifiably. This might be interpreted as 
a form of self-confidence or assurance which 
comes from having accurate perceptions and know- 
ing they are accurate. This would hardly seem to 
be an undesirable characteristic, although it may 
be an authoritarian one. Also, it may be that 
this test did not actually measure tolerance of am - 
biguity as purported. The concept of tolerance of 
ambiguity apparently needs more study. 

It would appear from these data, then, that 
there is some evidence supporting the hypothesis 
that there is a syndrome of authoritarian behav- 
ioral characteristics, although these data are 
more frequently not statistically significant than 
otherwise. There isan unmistakable pattern evi- 
dent, but even though it is apparent, it is slight. 
Thus, the idea that there is an interrelationship 
of characteristics which would indicate the exis- 
tence of a syndrome seems more plausible log- 
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ically that tenable experimentally, at least as 
measured in this study. The characterization of 
the authoritarian personality as it has been dis- 
cussed in other research seems to exist, but in a 
very indefinite and incomplete sense. It would 
seem that perhaps this theoretical characteriza- 
tion has been overdrawn, especially as regards 
intolerance of ambiguity. 


Conclusions 


Persons from different cultural situations 
seem to perceive the same sounds differently. * 
Those persons from the Northern urban area who 
participated in the present study apparently had 
more perceptions and were more accurate in their 
perceptions. 

Further, it seems that authoritarians are less 
‘‘open’’ in the way they perceive aural stimuli. In 
other words, since they are less accurate, appar- 
ently have fewer perceptions, andare less able to 
assert their independence, among other things, 
those persons who are authoritarians do not seem 
to be as effective in the perceptive process as are 
nonauthoritarians. 

Conversely, the nonauthoritarians seem to have 
more perceptions, to be more accurate in perceiv- 
ing aural stimuli, and assert their independence 
more effectively than authoritarians. Because of 
these specific differences in perceptual abilities, 
it appears that nonauthoritarians are more ‘‘open’’ 
to aural stimuli than authoritarians. 


Suggested Research 





During the course of the present study several 
related problems emerged. Some which the author 
feels deserve investigation are described: 


1. How do persons from Northern rural and South- 
ern urban cultural situations perceive aural 
stimuli? 

. What is the relationship of tolerance of ambi- 
guity to authoritarianism ? 

. Are the ‘‘authoritarian teacher’’ and the ‘‘au- 
thoritarian personality’’ the same? 

. Is there a difference between ‘‘direct’’ author- 
itarianism and ‘‘manipulation’’? 
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GROUP STRUCTURE, ANXIETY, AND PROB- 
LEM-SOLVING EFFICIENCY 


FRANK W. BANGHART 
University of Virginia 


THE PURPOSE of this study was to compare 
the relative differences regarding the influence of 
anxiety on problem-solving, between cooperative 
and non-cooperative groups. 


Background 


There is evidence in the literature to suggest 
that anxiety level does influence learning and per- 
formance. Castaneda and others (4) infer that 
‘*the tendency for the high anxious subjects to per- 
form more poorly in comparison to low anxious 
subjects increases as the difficulty of the task in- 
creases.’’ The difference found by Castaneda et 
al., was statistically significant at the .025 level. 
The difference was based upon the high anxious 
child doing more poorly on difficulty tasks than 
did the low anxious child. However, the high anx- 
ious child apparently did no better than the low 
anxious child on less difficult tasks. Palermo et 
al., (6) studied the relationship between anxiety 
and performance on a complex learning task and 
reported that the non-anxious subjects were supe- 
ior to the anxious subjects in every block of trials. 
The suggestion has been made in the Palermo et 
al., study that ‘‘the effect of increases in motiva- 
tion on performance depends upon the relative 
strength of the correct and incorrect responses 
aroused by the experimental situation. Calvin 
et al., (3) report no relationship between anxiety 
and intelligence, but do report that anxiety does 
significantly inhibit performance. Sarasen (7) in 
a study of the effects of anxiety on two kinds of 
failure on social learning reported no significant 
difference between the control and experimental 
groups. Yet Flanders (5) concludes that ‘‘student 
behavior associated with interpersonal anxiety 
takes priority over behavior oriented toward the 
achievement problem.’’ Ausubel et al., (1) re- 
porting on a study dealing with anxiety and the 
learning process suggest that ‘‘the low anxiety 
group was significantly superior to the high anx- 
iety group on the first trial of the maze, but this 
superiority was not maintained over the course of 








ten trials. ”’ 

The above studies suggest that there are fac- 
tors other than anxiety involved in situations where 
anxiety is a significant influence. In other words, 
anxiety per se is not necessarily the sole deter- 
mining factor, but, rather, it becomes a signifi- 
cant influence under certaincircumstances. It was 
the purpose of this study to investigate the rela- 
tive influence of anxiety when groups are arranged 
cooperatively and non-cooperatively. 


Procedure 


Twenty-four university students were given the 
Taylor Anxiety Scale. The subjects were then 
assigned to either a cooperative or non-coopera- 
tive experimental group. The basic difference be- 
tween the two groups being in the free exchange of 
ideas and information on the part of the coopera- 
tive group. No deliberate communication was 
permitted for the non-cooperative group. 

The experimental apparatus | described in de- 
tail elsewhere (2)] consisted of ten colored lights 
arranged on a panel. The problem involved the 
subjects’ prediction of which light would come on 
next. The light series were predetermined and 
automatically controlled. The total experimental 
situation consisted of three problems; the first of 
which might be considered a ‘‘warm-up”’ task. 
Problem two consisted of a simple numerical so- 
lution (1,3,5,7,9); problem three required a 
place-color solution (red-red; green-green; etc.,) 
and involved a shift from a numerical to a color 
set. In terms of solution time and problem-solv- 
ing efficiency, problem three was approximated 
fifty percent more difficult than problem two. 


Results 


Table I summarizes the data in terms of prob- 
lem-solving time and anxiety. Correlations 
were run between problem-solving time and anx- 
iety for problems two and three for cooperative 
and non-cooperative groups. Table I suggests 


*This study was made possible by a research grant from the Group Psychology Branch, Office of Naval 
Research, Contract No. NONR 474 (8). 
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TABLE I 


CORRELATION BETWEEN PROBLEM-SOLVING TIME 
AND ANXIETY 





Non-Cooperative (P 2) 
Problem-Solving Time Anxiety 


Cooperative (P 2) 
Problem-Solving Time Anxiety 





Non-Cooperative (P 3) 
Problem-Solving Time Anxiety 


Cooperative (P 3) 
Problem-Solving Time Anxiety 





P 2 = Problem two 
P 3 = Problem three 
p = probability of derived correlation coefficient 


TABLE II 


CORRELATION BETWEEN PROBLEM-SOLVING EFFICIENCY 
AND ANXIETY 





Non-Cooperative (P 2) 
Problem-Solving Efficiency Anxiety 


Cooperative (P 2) 
Problem-Solving Efficiency Anxiety 





Non-Cooperative (P 3) 
Problem-Solving Efficiency 


Cooperative (P 3) 
Problem-Solving Efficiency 





P 2 = Problem two 
P 3 = Problem three 
p = probability of derived correlation coefficient 
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that there was very little correlation (.06and .10) 
between problem-solving time for either the non- 
cooperative or cooperative groups on problem two. 
Since problem two can be considered to be an 
‘feasy’’ problem, the results are consistent with 
the literature on the subject. However, our inter- 
est in this study was on the relative influence of 
anxiety between cooperative and non-cooperative 
groups. On the basis of this study no difference 
can be assumed (p = . 448). 


For problem three, which was fifty percent 
more difficult than problem two, an interesting 
discrepancy exists between the two groups (r = .04 
for non-cooperative groups; r = .55 for coopera- 
tive groups, p = .047). One possible explanation 
of the discrepancy between the two groups on prob- 
lem three might involve the thinking of the Flan - 
der’s article (5). The suggestion that the non- 
cooperative groups were problem (achievement) 
centered, whereas the cooperative groups were 
interpersonal centered, the assumption is that 
anxiety interferes more when interpersonal rela- 
tions are involved. This is speculation, of course, 
and needs to be empirically checked out. 


Table Il summarizes the data dealing with the 
relationship between problem-solving efficienc 
and anxiety. Efficiency in this case is defined as 
the ratio of correct to total responses. 

Again, the correlations were run for problems 
two and three for cooperative and non-cooperative 
groups. 


Reference to Table Il suggests some interfer- 
ence (r = -.35 and -. 45) by anxiety with problem- 
solving efficiency. However, the relative influ- 
ence does notseem to differ significantly between 
the two groups (p = .352). (Reference might be 
made to the relative influence of anxiety on prob- 
lem-solving efficiency and problem-solving time 
for problem two. ) 


For problem three a difference again emerges 
(p = .068) between the two groups. Attention is 
called to the decrease of anxiety influence, for the 
non-cooperative groups, from problem one (easy) 
to problem two (hard). Again, any attempt to ex- 
plain the discrepancies between the two groups is 
in the realm of conjecture, yet attention should at 
least be called to other studies which might fur - 
nish a plausible explanation. A combination of 
the suggestions made by the Flanders (5) study 
and the Ausubel (1) study that the achievement 
centered subjects are less influenced by anxiety 
than are interpersonal relations centered subjects 
(Flanders), and thatthe discrepancy between high 
and low anxiety subjects does not hold up with 
time (Ausubel) might account for the findings of 
Tables I and II. Certainly some follow-up of these 
suggestions is indicated. 





Summary 


An experimental study was made, using twenty- 
four university students, onthe relative difference 
regarding the influence of anxiety on problem- 
solving between subjects assigned to cooperative 
and non-cooperative groups. Correlations were 
made between problem-solving time and anxiety 
and between problem-s 01 ving efficiency and anx- 
iety. Tests were made regarding the significance 
of the differences between cooperative and non- 
cooperative groups. 

The results suggest a minimal influence (r = 
.06 and .10) of anxiety on problem-solving time 
for the ‘‘easy’’ problem for both cooperative and 
non-cooperative groups (p = .448). However, anx- 
iety seemed to have a more pronounced influence 
on the cooperative group than on the non-cooper- 
ative groups for the ‘‘hard’’ problem (r = . 04 and 
. 55; p = .047). 

In terms of efficiency, anxiety seemed to have 
about the same influence for both groups (r = .35 
and -. 45; p = .352) on the ‘“‘easy’’ problem. 

However, on the ‘‘hard’’ problem, anxiety 
seemed to be more influential with the cooperative 
group than with the non-cooperative group in 
terms of influencing efficiency (r = .01 and -. 48; 
p = . 068). 

At most, the results reported in this paper 
should be considered as exploratory and tentative. 
Certainly much additional work needs to be done 
in this area. A word of caution should be added 
also regarding the technique of measuring efficien- 
cy by a gross device such as right responses to 
total responses. Presumably, one could geta 
better measurement of anxiety, if a greater dis- 
parity existed between high andlow anxiety scores. 
The present range (6 to 32, M = 13), with possi- 
bly too many scores around the mean, may not 
have been sensitive enough to discriminate between 
high and low anxiety. 

Again, it should be emphasized that more work 
needs to be done before anything beyond sugges- 
tion can be inferred. 
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